New df with columns from different df of unequal length - r

I am trying to create a new df new_df with columns from different data frames.
The columns are of unequal length, which I presume can be solved by replacing empty 'cells' with NA? However, this is above my current skill level, so any help will be much appreciated!
Packages:
library(tidyverse)
library(ggplot2)
library(here)
library(readxl)
library(gt)
I want to create new_df with columns from the following subsets:
Kube_liten$Unit_cm
Kube_Stor$Unit_cm

You can try the following, which extends the "short" vector with NA values:
col1 <- 1:9
col2 <- 1:12
col1[setdiff(col2, col1)] <- NA
data_comb <- data.frame(col1, col2)
# or
# data_comb <- cbind(col1, col2)
Output:
col1 col2
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 NA 10
11 NA 11
12 NA 12
Since you didn't provide sample data or a desired output, I don't know if this will be the exact approach for your data.

Novice, we appreciate that you are new to R. But please study a few basics. In particular how vector recycle.
Your problem:
vec1 <- c(1,2,3)
vec2 <- c("A","B","C","D","E")
df <- data.frame(var1 = vec1, var2 = vec2)
Error in data.frame(var1 = vec1, var2 = vec2) :
arguments imply differing number of rows: 3, 5
You may "glue" vectors together with cbind - check out the warning. The problem of different vector length is not gone.
df <- cbind(vec1, vec2)
Warning message:
In cbind(vec1, vec2) :
number of rows of result is not a multiple of vector length (arg 1)
What you get - vec1 is "recycled". In principle R assumes you want to fill the missing places by repeating the values ... (which might not what you want).
df
vec1 vec2
[1,] "1" "A"
[2,] "2" "B"
[3,] "3" "C"
[4,] "1" "D"
[5,] "2" "E"
## you can convert this to a data frame, if you prefer that object structure
Warning message:
In cbind(vec1, vec2) :
number of rows of result is not a multiple of vector length (arg 1)
> df
vec1 vec2
1 1 A
2 2 B
3 3 C
4 1 D
5 2 E
So your approach to extend the unequal vector length with NA is valid (and possibly what you want).
Thus, you are on the right way.
determine the length of your longest vector
inject NAs where needed (mind you you may not want to have them always at the end)
This problem can be found on Stackoverflow. Check out
How to cbind or rbind different lengths vectors without repeating the elements of the shorter vectors?

Related

How to combine columns that have the same name and remove NA's?

Relatively new to R, but I have an issue combining columns that have the same name. I have a very large dataframe (~70 cols and 30k rows). Some of the columns have the same name. I wish to merge these columns and remove the NA's.
An example of what I would like is below (although on a much larger scale).
df <- data.frame(x = c(2,1,3,5,NA,12,"blah"),
x = c(NA,NA,NA,NA,9,NA,NA),
y = c(NA,5,12,"hop",NA,2,NA),
y = c(2,NA,NA,NA,8,NA,4),
z = c(9,5,NA,3,2,6,NA))
desired.result <- data.frame(x = c(2,1,3,5,9,12,"blah"),
y = c(2,5,12,"hop",8,2,4),
z = c(9,5,NA,3,2,6,NA))
I have tried a number of things including suggestions such as:
R: merging columns and the values if they have the same column name
Combine column to remove NA's
However, these solutions either require a numeric dataset (I need to keep the character information) or they require you to manually input the columns that are the same (which is too time consuming for the size of my dataset).
I have managed to solve the issue manually by creating new columns that are combinations:
df$x <- apply(df[,1:2], 1, function(x) x[!is.na(x)][1])
However I don't know how to get R to auto-identify where the columns have the same names and then apply something like the above such that I don't need to specify the index each time.
Thanks
here is a base R approach
#split into a named list, nased on colnames befote the .-character
L <- split.default(df, f = gsub("(.*)\\..*", "\\1", names(df)))
#get the first non-na value for each row in each chunk
L2 <- lapply(L, function(x) apply(x, 1, function(y) na.omit(y)[1]))
# result in a data.frame
as.data.frame(L2)
# x y z
# 1 2 2 9
# 2 1 5 5
# 3 3 12 NA
# 4 5 hop 3
# 5 9 8 2
# 6 12 2 6
# 7 blah 4 NA
# since you are using mixed formats, the columsn are not of the same class!!
str(as.data.frame(L2))
# 'data.frame': 7 obs. of 3 variables:
# $ x: chr "2" "1" "3" "5" ...
# $ y: chr " 2" "5" "12" "hop" ...
# $ z: num 9 5 NA 3 2 6 NA

List index by number and some element NULL,how to convert to data frame?

In R program, the list length is unknow.It is generated from for loop.
for example:
ls <- list()
ls[[1]] <- 5
ls[[3]] <- a
ls[[6]] <- 8
....
Some index(ordinal number) is undefined.
I want to convert to data frame, such as follows:
1 5
2 NA
3 a
4 NA
5 NA
6 8
...
Additional question: how to get the ordinal number range of this list?
A base R approach could be, assuming here you have "ls" is already there in the environment :
Explanation:
We first iterate through all all the elements using lapply, In the anonymous function part, we try to find the null values, where ever there is null value found , we replace with NA. Once the list NULL values are replaced with NA, we bind them row wise using 'rbind' from do.call. To get the last part as sequence, we can use either seq function or colon operator to create a sequence.
dfs <- data.frame(col1 = do.call('rbind', lapply(ls,
function(x)ifelse(is.null(x), NA, x))),
col2 = seq(1,length(ls)), stringsAsFactors = F)
Alternate Using unlist(instead of do.call and rbind) :
dfs <- data.frame(col1 = unlist(lapply(ls,
function(x)ifelse(is.null(x), NA, x))), col2 =
seq(1,length(ls)), stringsAsFactors = F)
Output:
> dfs
# col1 col2
# 1 3 1
# 2 NA 2
# 3 6 3
# 4 NA 4
# 5 NA 5
# 6 8 6

Why does class change from integer to character when indexing a data frame with a numeric matrix?

If I index a data.frame of all integers with a matrix, I get the expected result.
df <- data.frame(c1=1:4, c2=5:8)
df1
# c1 c2
#1 1 5
#2 2 6
#3 3 7
#4 4 8
df1[matrix(c(1:4,1,2,1,2), nrow=4)]
# [1] 1 6 3 8
If the data.frame has a column of characters, the result is all characters, even though I'm only indexing the integer columns.
df2 <- data.frame(c0=letters[1:4], c1=1:4, c2=5:8)
df2
# c0 c1 c2
#1 a 1 5
#2 b 2 6
#3 c 3 7
#4 d 4 8
df2[matrix(c(1:4,2,3,2,3), nrow=4)]
# [1] "1" "6" "3" "8"
class(df[matrix(c(1:4,2,3,2,3), nrow=4)])
# [1] "character"
df2[1,2]
# [1] 1
My best guess is that R is too busy to go through the answer to check if they all originated from a certain class. Can anyone please explain why this is happening?
In ?Extract it is described that indexing via a numeric matrix is intended for matrices and arrays. So it might be surprising that such indexing worked for a data frame in the first place.
However, if we look at the code for [.data.frame (getAnywhere(`[.data.frame`)), we see that when extracting elements from a data.frame using a matrix in i, the data.frame is first coerced to a matrix with as.matrix:
function (x, i, j, drop = if (missing(i)) TRUE else length(cols) ==
1)
{
# snip
if (Narg < 3L) {
# snip
if (is.matrix(i))
return(as.matrix(x)[i])
Then look at ?as.matrix:
"The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column".
Thus, because the first column in "df2" is of class character, as.matrix will coerce the entire data frame to a character matrix before the extraction takes place.

In R, can out of bounds indexing return NAs on matrices, like it does on vectors?

I would like an out of bounds subscript on a matrix in R to return NAs instead of an error, like it does on vectors.
> a <- 1:3
> a[1:4]
[1] 1 2 3 NA
> b <- matrix(1:9, 3, 3)
> b
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> b[1:4, 1]
Error: subscript out of bounds
>
So I would have liked it to return:
[1] 1 2 3 NA
Right now I am doing this with ifelse tests to see if the index variables exist in the rownames but on large data structures this is taking quite a bit of time. here is an example:
s <- split(factors, factors$date) # split so each date has its own list
names <- last(s)[[1]]$bond # names of bonds that we want
cdmat <- sapply(names, function(n)
sapply(s, function(x)
if(n %in% x$bond) x[x$bond == n, column] else NA))
where factors is an xts with about 250 000 rows. So it's taking about 15 seconds and that's too long for my application.
The reason this is important is that each list element I am applying this to has a different length, but I need to output a matrix with equal length columns as a result of the sapply. I don't want another list out with different length elements.
Actually I have just realised that if I take the column I want and turn it into a vector, this works perfectly. So:
> b[, 1][1:4]
[1] 1 2 3 NA

Average across Columns in R, excluding NAs

I can't imagine I'm the first person with this question, but I haven't found a solution yet (here or elsewhere).
I have a few columns, which I want to average in R. The only minimally tricky aspect is that some columns contain NAs.
For example:
Trait Col1 Col2 Col3
DF 23 NA 23
DG 2 2 2
DH NA 9 9
I want to create a Col4 that averages the entries in the first 3 columns, ignoring the NAs.
So:
Trait Col1 Col2 Col3 Col4
DF 23 NA 23 23
DG 2 2 2 2
DH NA 9 9 9
Ideally something like this would work:
data$Col4 <- mean(data$Chr1, data$Chr2, data$Chr3, na.rm=TRUE)
but it doesn't.
You want rowMeans() but importantly note it has a na.rm argument that you want to set to TRUE. E.g.:
> mat <- matrix(c(23,2,NA,NA,2,9,23,2,9), ncol = 3)
> mat
[,1] [,2] [,3]
[1,] 23 NA 23
[2,] 2 2 2
[3,] NA 9 9
> rowMeans(mat)
[1] NA 2 NA
> rowMeans(mat, na.rm = TRUE)
[1] 23 2 9
To match your example:
> dat <- data.frame(Trait = c("DF","DG","DH"), mat)
> names(dat) <- c("Trait", paste0("Col", 1:3))
> dat
Trait Col1 Col2 Col3
1 DF 23 NA 23
2 DG 2 2 2
3 DH NA 9 9
> dat <- transform(dat, Col4 = rowMeans(dat[,-1], na.rm = TRUE))
> dat
Trait Col1 Col2 Col3 Col4
1 DF 23 NA 23 23
2 DG 2 2 2 2
3 DH NA 9 9 9
Why NOT the accepted answer?
The accepted answer is correct, however, it is too specific to this particular task and impossible to be generalized. What if we need, instead of mean, other statistics like var, skewness, etc. , or even a custom function?
A more flexible solution:
row_means <- apply(X=data, MARGIN=1, FUN=mean, na.rm=TRUE)
More details on apply:
Generally, to apply any function (custom or built-in) on the entire dataset, column-wise or row-wise, apply or one of its variations (sapply, lapply`, ...) should be used. Its signature is:
apply(X, MARGIN, FUN, na.rm)
where:
X: The data of form dataframe or matrix.
MARGIN: The dimension on which the aggregation takes place. Use 1 for row-wise operation and 2 for column-wise operation.
FUN: The operation to be called on the data. Here any pre-defined R functions, as well as any user-defined function could be used.
na.rm: If TRUE, the NA values will be removed before FUN is called.
Why should I use apply?
For many reasons, including but not limited to:
Any function can be easily plugged in to apply.
For different preferences such as the input or output data types, other variations can be used (e.g., lapply for operations on lists).
(Most importantly) It facilitates scalability since there are versions of this function that allows parallel execution (e.g. mclapply from {parallel} library). For instance, see [+] or [+].

Resources