I want to build a matrix or data frame by choosing names of columns where the element in the data frame contains does not contain an NA. For example, suppose I have:
zz <- data.frame(a = c(1, NA, 3, 5),
b = c(NA, 5, 4, NA),
c = c(5, 6, NA, 8))
which gives:
a b c
1 1 NA 5
2 NA 5 6
3 3 4 NA
4 5 NA 8
I want to recognize each NA and build a new matrix or df that looks like:
a c
b c
a b
a c
There will be the same number of NAs in each row of the input matrix/df. I can't seem to get the right code to do this. Suggestions appreciated!
library(dplyr)
library(tidyr)
zz %>%
mutate(k = row_number()) %>%
gather(column, value, a, b, c) %>%
filter(!is.na(value)) %>%
group_by(k) %>%
summarise(temp_var = paste(column, collapse = " ")) %>%
separate(temp_var, into = c("var1", "var2"))
# A tibble: 4 × 3
k var1 var2
* <int> <chr> <chr>
1 1 a c
2 2 b c
3 3 a b
4 4 a c
Here's a possible vectorized base R approach
indx <- which(!is.na(zz), arr.ind = TRUE)
matrix(names(zz)[indx[order(indx[, "row"]), "col"]], ncol = 2, byrow = TRUE)
# [,1] [,2]
#[1,] "a" "c"
#[2,] "b" "c"
#[3,] "a" "b"
#[4,] "a" "c"
This finds non-NA indices, sorts by rows order and then subsets the names of your zz data set according to the sorted index. You can wrap it into as.data.frame if you prefer it over a matrix.
EDIT: transpose the data frame one time before process, so don't need to transpose twice in loop in first version.
cols <- names(zz)
for (column in cols) {
zz[[column]] <- ifelse(is.na(zz[[column]]), NA, column)
}
t_zz <- t(zz)
cols <- vector("list", length = ncol(t_zz))
for (i in 1:ncol(t_zz)) {
cols[[i]] <- na.omit(t_zz[, i])
}
new_dt <- as.data.frame(t(do.call("cbind", cols)))
The tricky part here is your goal actually change data frame structure, so the task of "remove NA in each row" have to build row by row as new data frame, since every column in each row could came from different column of original data frame.
zz[1, ] is a one row data frame, use t to convert it into vector so we can use na.omit, then transpose back to row.
I used 2 for loops, but for loops are not necessarily bad in R. The first one is vectorized for each column. The second one need to be done row by row anyway.
EDIT: growing objects is very bad in performance in R. I knew I can use rbindlist from data.table which can take a list of data frames, but OP don't want new packages. My first attempt just use rbind which could not take list as input. Later I found an alternative is to use do.call. It's still slower than rbindlist though.
Related
I am trying to produce an loop function to sum up consecutive columns of values of a table and output them into another table
For example, in my original table, we have columns a, b, c, etc, which contain the same number of numeric values.
The resulting table then should be a, a+b, a+b+c, etc up to the last column of the original table
I have a feeling a for loop should be sufficient for this operation however can't get my head around the format and syntax.
Any help would be appreciated!
Since you're new, here is an example of a very minimal minimal reproducible example?
library(data.table)
x = data.table(a=1:3,b=4:6,c=7:9)
for(... now what?
And here's a way to do your task:
library(data.table)
# make some dummy data
X = data.table(a=1:2,b=3:4,c=5:6)
# make an empty result table
Y = data.table()
# for i = 1 to the number of columns in X
for(i in 1:ncol(X)){
# colnames(X) is "a" "b" "c".
# colnames(X)[1:1] is "a", colnames(X)[1:2] is "a" "b", colnames(X)[1:3] is "a" "b" "c"
# paste0(colnames(X)[1:1],collapse='') is "a",
# paste0(colnames(X)[1:2],collapse='') is "ab",
# paste0(colnames(X)[1:3],collapse='') is "abc"
newcolname = paste0(colnames(X)[1:i],collapse='')
# Y[,(newcolname):= is data.table syntax to create a new column called newcolname
# X[,1:i] selects columns 1 to i
# rowSums calculates the, um, row sums :D
Y[,(newcolname):=rowSums(X[,1:i])]
}
Maybe you need Reduce like below
cbind(
df,
setNames(
as.data.frame(Reduce(`+`, df, accumulate = TRUE)),
Reduce(paste0, names(df), accumulate = TRUE)
)
)
such that
a b c a ab abc
1 1 4 7 1 5 12
2 2 5 8 2 7 15
3 3 6 9 3 9 18
Data
df <- structure(list(a = 1:3, b = 4:6, c = 7:9), class = "data.frame", row.names = c(NA,
-3L))
How can I sum one specific column in all data frames in a list an put them in a new data frame?
An small example is:
A <- data.frame(matrix( nrow = 2, ncol = 2))
B <- data.frame(matrix( nrow = 2, ncol = 2))
A[,] <- 3
B[,] <- 4
l <- list(A,B)
So let's say I want to sum up all columns "X1" in my list and put in one data frame (vector, since there only should be one row). This data frame should then have value 6 (3+3) in first row and 8 (4+4) in the second.
In the real data I have 18 data frames in the list and the columns to sum in each data frame is of different lenght.
Mabye I should use the sapply or lapply function?
You can use colSums, i.e.
do.call(rbind, lapply(l, function(i)colSums(i['X1'])))
# X1
#[1,] 6
#[2,] 8
Here is one option with sapply where we Extract the column 'X1' into a matrix and then do the colSums
colSums(sapply(l, `[[`, 'X1'))
#[1] 6 8
Or with map from purrr
library(purrr)
library(dplyr)
map_dbl(l, ~ .x %>%
pull(X1) %>%
sum)
#[1] 6 8
If it is needed as a data.frame
map_dfr(l, ~ .x %>%
summarise(X1 = sum(X1)))
# X1
#1 6
#2 8
I want to replace the values of one element of a list with the values of a second element of a list. Specifically,
I have a list containing multiple data sets.
Each data set has 2 variables
The variables are factors
The n'th element of the second variable of each data set needs to be replaced with the n'th element of the first variable in each data set
Also, the replaced value should be called "replaced"
dat1 <- data.frame(names1 =c("a", "b", "c", "f", "x"),values= c("val1_1", "val2_1", "val3_1", "val4_1", "val5_1"))
dat1$values <- as.factor(dat1$values)
dat2 <- data.frame(names1 =c("a", "b", "f2", "s5", "h"),values= c("val1_2", "val2_2", "val3_2", "val4_2", "val5_2"))
dat2$values <- as.factor(dat2$values)
list1 <- list(dat1, dat2)
The result should be the same list, but just with the 5th value replaced.
[[1]]
names1 values
1 a val1_1
2 b val2_1
3 c val3_1
4 f val4_1
5 replaced x
[[2]]
names1 values
1 a val1_2
2 b val2_2
3 f2 val3_2
4 s5 val4_2
5 replaced h
A base R approach using lapply, since both the columns are factors we need to add new levels first before replacing them with new values otherwise those value would turn as NAs.
n <- 5
lapply(list1, function(x) {
levels(x$values) <- c(levels(x$values), as.character(x$names1[n]))
x$values[n] <- x$names1[n]
levels(x$names1) <- c(levels(x$names1), "replaced")
x$names1[n] <- "replaced"
x
})
#[[1]]
# names1 values
#1 a val1_1
#2 b val2_1
#3 c val3_1
#4 f val4_1
#5 replaced x
#[[2]]
# names1 values
#1 a val1_2
#2 b val2_2
#3 f2 val3_2
#4 s5 val4_2
#5 replaced h
There is also another approach where we can convert both the columns to characters, then replace the values at required position and again convert them back to factors but since every dataframe in the list can be huge we do not want to convert all the values to characters and then back to factor just to change one value which could be computationally very expensive.
Here is one option with tidyverse. Loop through the list with map, slice the row of interest (in this case, it is the last row, so n() can be used), mutate the column value and bind with the original data without the last row
library(tidyverse)
map(list1, ~ .x %>%
slice(n()) %>%
mutate(values = names1, names1 = 'replaced') %>%
bind_rows(.x %>% slice(-n()), .))
#[[1]]
# names1 values
#1 a val1_1
#2 b val2_1
#3 c val3_1
#4 f val4_1
#5 replaced x
#[[2]]
# names1 values
#1 a val1_2
#2 b val2_2
#3 f2 val3_2
#4 s5 val4_2
#5 replaced h
Or it can be made more compact with fct_c from forcats. Different factor levels can be combined together with fct_c for the 'values' and 'names1' column
library(forcats)
map(list1, ~ .x %>%
mutate(values = fct_c(values[-n()], names1[n()]),
names1 = fct_c(names1[-n()], factor('replaced'))))
Or using similar approach with base R where we loop through the list with lapply, then convert the data.frame to matrix, rbind the subset of matrix i.e. the last row removed with the values of interest, and convert to data.frame (by default, stringsAsFactors = TRUE - so it gets converted to factor)
lapply(list1, function(x) as.data.frame(rbind(as.matrix(x)[-5, ],
c('replaced', as.character(x$names1[5])))))
I have a vector containing a combination of NA values and strings:
v <- c(NA, NA, "text", NA)
I also have a separate data frame:
df <- data.frame("Col1" = 1:4, "Col2" = 5:8)
Col1 Col2
1 5
2 6
3 7
4 8
My goal is to remove the rows of df where the corresponding v value is NA. So in this case the output would just be:
Col1 Col2
3 7
Since the third element of v is the only one that's not NA, only the third row of df is kept. I tried to accomplish this using a for loop:
for (i in 1:length(v)) {
if (is.na(v[i])) {
df <- df[-i, ]
}
}
However, for some reason this just outputs a version of df that includes only the 2nd and 4th rows:
Col1 Col2
2 6
4 8
I can't figure out why the loop isn't working. Any suggestions appreciated!
This will do it -
df[!is.na(v), ]
You don't need a loop. You can always subset any dataframe using a vector of row indices or logical vector (TRUE and FALSE). !is.na(v) generates a logical vector based on v and subsets the dataframe accordingly.
I have data frame that I have to initialized as empty data frame.
Now I have only column available, I want to add it to empty data frame. How I can do it? I will not sure what will be length of column in advance.
Example
df = data.frame(a= NA, b = NA, col1= NA)
....
nrow(col1) # Here I will know length of column, and I only have one column available here.
df$col1 <- col1
error is as follows:
Error in `$<-.data.frame`(`*tmp*`, "a", value = c("1", :
replacement has 5 rows, data has 1
Any help will be greatful
use cbind
df = data.frame(a= NA, b = NA)
col1 <- c(1,2,3,4,5)
df <- cbind(df, col1)
# a b col1
# 1 NA NA 1
# 2 NA NA 2
# 3 NA NA 3
# 4 NA NA 4
# 5 NA NA 5
After your edits, you can still use cbind, but you'll need to drop the existing column first (or handle the duplicate columns after the cbind)
cbind(df[, 1:2], col1)
## or if you don't know the column indeces
## cbind(df[, !names(df) %in% c("col1")], col1)
A little workaround with lists:
l <- list(a=NA, b=NA, col1=NA)
col1 <- c(1,2,3)
l$col1 <- col1
df <- as.data.frame(l)
I like both answers provided by Symbolix and maRtin, I have done my own hack. My hack is as follow.
df[1:length(a),"a"] = a
However, I am not sure, which one this method is efficient in term of time. What will be big O notion for time