Repeated values are stored when copied to data frame - r

I have a data frame like x,
> x
x1 x2 x3 x4
1 3 5 7
3 4 7 2
1 7 8 7
2 3 7 4
I want to change each row based on some calculations. The resulting rows are not of same size. say, I have to want to copy a row of length 2,
y <- c(1,2)
x[1,] <- y
Then the values stored repeatedly in x,
> x
x1 x2 x3 x4
1 2 1 2
3 4 7 2
1 7 8 7
2 3 7 4
But my output should be,
> x
x1 x2 x3 x4
1 2 NA NA
3 4 7 2
1 7 8 7
2 3 7 4
How to do this?

You could pad the NAs based on the number of columns of 'x' by assigning length of 'y' to ncol(x). If the length of 'y' is less than ncol(x), it will pad the additional elements with NA.
x[1,] <- `length<-`(y, ncol(x))
x
# x1 x2 x3 x4
#1 1 2 NA NA
#2 3 4 7 2
#3 1 7 8 7
#4 2 3 7 4
Just for easier understanding, this is similar to the two step process #mpalanco mentioned in the comments, i..e first we change the length(y) to be the length(x) (or ncol(x) - in a 'data.frame', length and ncol are the same) to pad NAs and then replace the first row value of 'x' with that of 'y'.
length(y) <- length(x)
x[1,] <- y
data
x <- structure(list(x1 = c(1L, 3L, 1L, 2L), x2 = c(3L, 4L, 7L, 3L),
x3 = c(5L, 7L, 8L, 7L), x4 = c(7L, 2L, 7L, 4L)), .Names = c("x1",
"x2", "x3", "x4"), class = "data.frame", row.names = c(NA, -4L))

No clever solution came to mind, so here is a small function which pads y with NA. Using the padded y gives the intended behavior.
pad_with_NA <- function(y, dim_x) {
if(length(y)<dim_x) {
y <- c(y, rep(NA, dim_x-length(y)))
}
y
}
x[1,] <- pad_with_NA(y, dim(x)[2])

Related

How to change values across 1 row based on values in a column in R?

I have a lot of columns in 1 dataframe that identify different timepoints of the same variable. Basically, within my data, if there's no response at timepoint X-1, there will be no response at time point X or beyond (after an NA appears in a row, it will continue). I currently have a column that shows which row the last response came from and what that response is. The dataframe currently looks like this:
id X1 X2 X3 X4 X_final X_final_location
1 1 5 5 6 NA 6 X3
2 2 4 NA NA NA 4 X1
3 3 7 1 3 5 5 X4
4 4 8 2 4 2 2 X4
5 5 1 5 NA NA 5 X2
6 6 5 7 7 7 7 X4
My goal is to be able to conduct a regression using the last response of each row as the outcome variable. However, I don't want it to repeat twice in the "X_final" column and also in the column that the response actually comes from. Therefore, I am hoping to find a way to put a "." in for the cell where that value originally came from so it looks like this:
id X1 X2 X3 X4 X_final X_final_location
1 1 5 5 6 NA 6 X3
2 2 . <NA> NA NA 4 X1
3 3 7 1 3 5 5 X4
4 4 8 2 4 2 2 X4
5 5 1 . NA NA 5 X2
6 6 5 7 7 7 7 X4
Any suggestions would be appreciated - thank you!
Another method, since you already have the locations in $X_final_location. As mentioned in the question comments, NA values would be preferred if the goal would be regression analysis to preserve numeric values.
data_orig <- data.frame(
id = c(1, 2, 3, 4, 5, 6),
X1 = c(5, 4, 7, 8, 1, 5),
X2 = c(5, NA, 1, 2, 5, 7),
X3 = c(6, NA, 3, 4, NA, 7),
X4 = c(NA, NA, 5, 2, NA, 7),
X_final = c(6, 4, 5, 2, 5, 7),
X_final_location = c("X3", "X1", "X4", "X4", "X2", "X4")
)
data_new <- data_orig
for (i in seq_len(nrow(data_new))) {
data_new[i, data_new$X_final_location[i]] <- NA
}
data_new
# id X1 X2 X3 X4 X_final X_final_location
# 1 1 5 5 NA NA 6 X3
# 2 2 NA NA NA NA 4 X1
# 3 3 7 1 3 NA 5 X4
# 4 4 8 2 4 NA 2 X4
# 5 5 1 NA NA NA 5 X2
# 6 6 5 7 7 NA 7 X4
One way to do this (NA instead of . to preserve data type):
match finds the first NA position, replace replaces the value in that position - 1 (previous) with NA.
apply(data, 1, \(x) ...) applies that function for each row. Finally t transposes the result (since apply by default coerces the result to columns.
data = data.frame(id = 1:6, X1 = c(5L, 4L, 7L, 8L, 1L, 5L), X2 = c(5L,
NA, 1L, 2L, 5L, 7L), X3 = c(6L, NA, 3L, 4L, NA, 7L), X4 = c(NA,
NA, 5L, 2L, NA, 7L), X_final = c(6L, 4L, 5L, 2L, 5L, 7L), X_final_location = c("X3",
"X1", "X4", "X4", "X2", "X4"))
data[,2:5] <- t(apply(data[,2:5], 1 , function(x) replace(x, match(NA, x) - 1, NA)))
data
#> id X1 X2 X3 X4 X_final X_final_location
#> 1 1 5 5 NA NA 6 X3
#> 2 2 NA NA NA NA 4 X1
#> 3 3 7 1 3 5 5 X4
#> 4 4 8 2 4 2 2 X4
#> 5 5 1 NA NA NA 5 X2
#> 6 6 5 7 7 7 7 X4
Another way using split (grouping by row):
split(data, row.names(data)) <-
lapply(split(data, row.names(data)), \(x) replace(x, x$X_final_location, "."))

Using which function to transpose parts of columns under condition

Suppose we have the following data:
X Y
6
1
2
2
1 1
8
3
4
1
1 2
I want to convert it to:
X Y Y-1 Y-2 Y-3
6
1
2
2
1 1 2 2 1
8
3
4
1
1 2 1 4 3
That is: for rows with X=1 - take 3 previous Y values and append them to this row.
I "brute-forced" it with a loop:
namevector <- c("Y-1", "Y-2", "Y-3")
mydata[ , namevector] <- ""
for(i in 1:nrow(mydata)){
if(mydata$X[i] != ""){mydata[i,3:5] <- mydata$Y[(i-1):(i-3)]}
}
But it was too slow for my dataset of ~300k points - about 10 minutes.
Then I found a post with a similar question, and they proposed which function, which reduced the time to tolerable 1-2 minutes:
namevector <- c("Y-1", "Y-2", "Y-3")
mydata[ , namevector] <- ""
trials_rows <- which(mydata$X != "")
for (i in trials_rows) {mydata[i,3:5] <- mydata$Y[(i-1):(i-3)]}
But considering that which takes less than a second - I believe I can somehow combine which with some kind of transpose function, but I can't get my mind around it.
I have a big data frame (~300k rows), and ~6k rows have this "X" value.
Is there a fast and simple way to do it fast, instead of iterating through the results of which function?
You can do this with a single assignment using some vectorised trickery:
mydata[trials_rows, namevector] <- mydata$Y[trials_rows - rep(1:3,each=length(trials_rows))]
mydata
# X Y Y-1 Y-2 Y-3
#1 NA 6
#2 NA 1
#3 NA 2
#4 NA 2
#5 1 1 2 2 1
#6 NA 8
#7 NA 3
#8 NA 4
#9 NA 1
#10 1 2 1 4 3
Basically, take each row in trials_rows, look backwards three rows using a vectorised subtraction, and then overwrite the combination of trials_rows in rows and namevector in columns.
Reproducible example used here:
mydata <- structure(list(X = c(NA, NA, NA, NA, 1L, NA, NA, NA, NA, 1L),
Y = c(6L, 1L, 2L, 2L, 1L, 8L, 3L, 4L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-10L))

Sum many rows with some of them have NA in all needed columns

I am trying to do rowSums but I got zero for the last row and I need it to be "NA".
My df is
a b c sum
1 1 4 7 12
2 2 NA 8 10
3 3 5 NA 8
4 NA NA NA NA
I used this code based on this link; Sum of two Columns of Data Frame with NA Values
df$sum<-rowSums(df[,c("a", "b", "c")], na.rm=T)
Any advice will be greatly appreciated
For each row check if it is all NA and if so return NA; otherwise, apply sum. We have selected columns a, b and c even though that is all the columns because the poster indicated that there might be additional ones.
sum_or_na <- function(x) if (all(is.na(x))) NA else sum(x, na.rm = TRUE)
transform(df, sum = apply(df[c("a", "b", "c")], 1, sum_or_na))
giving:
a b c sum
1 1 4 7 12
2 2 NA 8 10
3 3 5 NA 8
4 NA NA NA NA
Note
df in reproducible form is assumed to be:
df <- structure(list(a = c(1L, 2L, 3L, NA), b = c(4L, NA, 5L, NA),
c = c(7L, 8L, NA, NA)),
row.names = c("1", "2", "3", "4"), class = "data.frame")

Subset data based on conditions in R

I have multiple inputs like:
a <- x y z
1 2 2
2 3 2
3 2 4
4 2 4
5 2 1
b <- c(1,2)
c <- c(2,3)
i want to subset this data based on a condition that a$x contains values greater than equal to b[i] and less than equal to c[i]
output should look like:
d <- x y z
1 2 2
2 3 2
2 3 2
3 2 4
i have tried this:
d = as.data.frame(matrix(ncol=3, nrow=0))
names(d) = names(a)
for (i in 1:length(b){
d <- rbind(d,a[which(a$x>=b[i] & a$x<=c[i]),])
}
Using dplyr::filter function:
sub_list <- lapply(1:length(b), function(i) a %>% filter(x >= b[i] & x <= c[i]))
do.call(rbind, sub_list)
x y z
1 1 2 2
2 2 3 2
3 2 3 2
4 3 2 4
Input data:
a <- structure(list(x = 1:5, y = c(2L, 3L, 2L, 2L, 2L), z = c(2L,
2L, 4L, 4L, 1L)), .Names = c("x", "y", "z"), class = "data.frame", row.names = c(NA,
-5L))
b <- c(1,2)
c <- c(2,3)

R split each row of a dataframe into two rows

I would like to splite each row of a data frame(numberic) into two rows. For example, part of the original data frame like this (nrow(original datafram) > 2800000):
ID X Y Z value_1 value_2
1 3 2 6 22 54
6 11 5 9 52 71
3 7 2 5 2 34
5 10 7 1 23 47
And after spliting each row, we can get:
ID X Y Z
1 3 2 6
22 54 NA NA
6 11 5 9
52 71 NA NA
3 7 2 5
2 34 NA NA
5 10 7 1
23 47 NA NA
the "value_1" and "value_2" columns are split and each element is set to a new row. For example, value_1 = 22 and value_2 = 54 are set to a new row.
Here is one option with data.table. We convert the 'data.frame' to 'data.table' by creating a column of rownames (setDT(df1, keep.rownames = TRUE)). Subset the columns 1:5 and 1, 6, 7 in a list, rbind the list element with fill = TRUE option to return NA for corresponding columns that are not found in one of the datasets, order by the row number ('rn') and assign (:=) the row number column to 'NULL'.
library(data.table)
setDT(df1, keep.rownames = TRUE)[]
rbindlist(list(df1[, 1:5, with = FALSE], setnames(df1[, c(1, 6:7),
with = FALSE], 2:3, c("ID", "X"))), fill = TRUE)[order(rn)][, rn:= NULL][]
# ID X Y Z
#1: 1 3 2 6
#2: 22 54 NA NA
#3: 6 11 5 9
#4: 52 71 NA NA
#5: 3 7 2 5
#6: 2 34 NA NA
#7: 5 10 7 1
#8: 23 47 NA NA
A hadleyverse corresponding to the above logic would be
library(dplyr)
tibble::rownames_to_column(df1[1:4]) %>%
bind_rows(., setNames(tibble::rownames_to_column(df1[5:6]),
c("rowname", "ID", "X"))) %>%
arrange(rowname) %>%
select(-rowname)
# ID X Y Z
#1 1 3 2 6
#2 22 54 NA NA
#3 6 11 5 9
#4 52 71 NA NA
#5 3 7 2 5
#6 2 34 NA NA
#7 5 10 7 1
#8 23 47 NA NA
data
df1 <- structure(list(ID = c(1L, 6L, 3L, 5L), X = c(3L, 11L, 7L, 10L
), Y = c(2L, 5L, 2L, 7L), Z = c(6L, 9L, 5L, 1L), value_1 = c(22L,
52L, 2L, 23L), value_2 = c(54L, 71L, 34L, 47L)), .Names = c("ID",
"X", "Y", "Z", "value_1", "value_2"), class = "data.frame",
row.names = c(NA, -4L))
Here's a (very slow) pure R solution using no extra packages:
# Replicate your matrix
input_df <- data.frame(ID = rnorm(10000),
X = rnorm(10000),
Y = rnorm(10000),
Z = rnorm(10000),
value_1 = rnorm(10000),
value_2 = rnorm(10000))
# Preallocate memory to a data frame
output_df <- data.frame(
matrix(
nrow = nrow(input_df)*2,
ncol = ncol(input_df)-2))
# Loop through each row in turn.
# Put the first four elements into the current
# row, and the next two into the current+1 row
# with two NAs attached.
for(i in seq(1, nrow(output_df), 2)){
output_df[i,] <- input_df[i, c(1:4)]
output_df[i+1,] <- c(input_df[i, c(5:6)],NA,NA)
}
colnames(output_df) <- c("ID", "X", "Y", "Z")
Which results in
> head(output_df)
X1 X2 X3 X4
1 0.5529417 -0.93859275 2.0900276 -2.4023800
2 0.9751090 0.13357075 NA NA
3 0.6753835 0.07018647 0.8529300 -0.9844643
4 1.6405939 0.96133195 NA NA
5 0.3378821 -0.44612782 -0.8176745 0.2759752
6 -0.8910678 -0.37928353 NA NA
This should work
data <- read.table(text= "ID X Y Z value_1 value_2
1 3 2 6 22 54
6 11 5 9 52 71
3 7 2 5 2 34
5 10 7 1 23 47", header=T)
data1 <- data[,1:4]
data2 <- setdiff(data,data1)
names(data2) <- names(data1)[1:ncol(data2)]
combined <- plyr::rbind.fill(data1,data2)
n <- nrow(data1)
combined[kronecker(1:n, c(0, n), "+"),]
Though why you would need to do this beats me.

Resources