Concatenate multiple data frames with different columns in R [duplicate] - r

This question already has answers here:
Combine two data frames by rows (rbind) when they have different sets of columns
(14 answers)
Closed 4 years ago.
I have multiple data frames (different rows and columns) and I am trying to concatenate them into one. They come with a different number of columns but equal names. Simply:
> colnames(data1)
"A" "B" "C" "D" "E" "F" "G" "H"
> colnames(data2)
"A" "B" "C" "D"
> colnames(data3)
"A" "D" "E" "F" "H"
I need to concatenate all three data frames into one in a way that match the column name, and if it is not matchable just insert "NA" for that particular column. Thanks in advance

Use dplyr::bind_rows:
data1 <- data.frame(a = 1:3)
data2 <- data.frame(a = 4:6, b = 7:9)
data3 <- data.frame(b = 11:13)
dplyr::bind_rows(data1, data2, data3)
# a b
#1 1 NA
#2 2 NA
#3 3 NA
#4 4 7
#5 5 8
#6 6 9
#7 NA 11
#8 NA 12
#9 NA 13

Related

extract vowels in a word and add values of vowels R programming Data science

I am new to Programming, We have been asked to do a Project
I have values of all letters (dataset1)
Letter value
a 1
b 2
c 3
d 4
.
.
.
.
Z 26
I have a list of many words (dataset2)
Wood
Table
Chair
Desk
I need to extract all vowels from the words and add up their values of vowels and store it back against respective words in dataset 2 in a separate column.
Desired Output
Word Sum_of_vowel_value
Wood 30 (15+15)
Table 6 (1+5)
Chair 10 (9+1)
I am new to stack overflow. Pl excuse errors if any in posting
Here is one crude approach in base R :
Split every character in Word column of dataset2, keep only vowels and match it with dataset1's Letter to get corresponding value and sum it.
dataset2$Sum_of_vowel_value <- sapply(strsplit(as.character(dataset2$Word), ""),
function(x) sum(dataset1$value[match(vowel[match(tolower(x), vowel)],
dataset1$Letter)], na.rm = TRUE))
dataset2
# Word Sum_of_vowel_value
#1 Wood 30
#2 Table 6
#3 Chair 10
#4 Desk 5
To understand this better we can break the function in steps.
We first split Word into separate characters
strsplit(as.character(dataset2$Word), "")
#[[1]]
#[1] "W" "o" "o" "d"
#[[2]]
#[1] "T" "a" "b" "l" "e"
#[[3]]
#[1] "C" "h" "a" "i" "r"
#[[4]]
#[1] "D" "e" "s" "k"
The next step is to keep only vowels.
sapply(strsplit(as.character(dataset2$Word), ""),
function(x) vowel[match(tolower(x), vowel)])
#[[1]]
#[1] NA "o" "o" NA
#[[2]]
#[1] NA "a" NA NA "e"
#[[3]]
#[1] NA NA "a" "i" NA
#[[4]]
#[1] NA "e" NA NA
Now for these vowels, we get corresponding value from dataset1
sapply(strsplit(as.character(dataset2$Word), ""),
function(x) dataset1$value[match(vowel[match(tolower(x), vowel)],
dataset1$Letter)])
#[[1]]
#[1] NA 15 15 NA
#[[2]]
#[1] NA 1 NA NA 5
#[[3]]
#[1] NA NA 1 9 NA
#[[4]]
#[1] NA 5 NA NA
Finally, we sum all these values to get final output as :
#[1] 30 6 10 5
data
vowel <- c('a', 'e', 'i', 'o', 'u')
dataset1 <- data.frame(Letter = letters, value = 1:26)
dataset2 <- structure(list(Word = structure(c(4L, 3L, 1L, 2L),
.Label = c("Chair", "Desk", "Table", "Wood"), class = "factor")),
row.names = c(NA, -4L), class = "data.frame")
If you haven't reached apply in your course yet, perhaps you've reached for loops and regular expressions?
Extract the vowels using gsub by substituting non-vowels with empty space:
dataset2$Vowels <- gsub("[^aeiou]", "", tolower(dataset2$Word))
Split the string vectors into individual letters.
vowels <- strsplit(dataset2$Vowels, "")
Initialise the scores
dataset2$Score <- 0
Use a for loop to count the values of the vowels using match.
for(i in 1:length(vowels)){
dataset2$Score[i] <- sum(dataset1$value[match(vowels[[i]], dataset1$Letter)], na.rm=TRUE)
}
dataset2
# Word Vowels Score
#1 Wood oo 30
#2 Table ae 6
#3 Chair ai 10
#4 Desk e 5
A for loop is basically the same as using sapply but a lot slower.

Display identical columns in R dataframe

Suppose I have the following dataframe :
df <- data.frame(A=c(1,2,3),B=c("a","b","c"),C=c(2,1,3),D=c(1,2,3),E=c("a","b","c"),F=c(1,2,3))
> df
A B C D E F
1 1 a 2 1 a 1
2 2 b 1 2 b 2
3 3 c 3 3 c 3
I want to filter out the columns that are identical. I know that I can do it with
DuplCols <- df[duplicated(as.list(df))]
UniqueCols <- df[ ! duplicated(as.list(df))]
In the real world my dataframe has more than 500 columns and I do not know how many identical columns of the same kind I have and I do not know the names of the columns. However, each columnname is unique (as in df). My desired result is (optimally) a dataframe where in each row the column names of the identical columns of one kind are stored. The number of columns in the DesiredResult dataframe is the maximal number of identical columns of one kind in the original dataframe and if there are less identical columns of another kind, NA should be stored:
> DesiredResult
X1 X2 X3
1 A D F
2 B E NA
3 C NA NA
(With "identical column of the same kind" I mean the following: in df the columns A, D, F are identical columns of the same kind and B, E are identical columns of the same kind.)
You can use unique and then test with %in% where it matches to extract the colname.
tt_lapply(unique(as.list(df)), function(x) {colnames(df)[as.list(df) %in% list(x)]})
tt
#[[1]]
#[1] "A" "D" "F"
#
#[[2]]
#[1] "B" "E"
#
#[[3]]
#[1] "C"
t(sapply(tt, "length<-", max(lengths(tt)))) #As data.frame
# [,1] [,2] [,3]
#[1,] "A" "D" "F"
#[2,] "B" "E" NA
#[3,] "C" NA NA

R: coalescing a large data frame

Say I create a data frame, foo:
foo <- data.frame(A=rep(NA,10),B=rep(NA,10))
foo$A[1:3] <- "A"
foo$B[6:10] <- "B"
which looks like,
A B
1 A <NA>
2 A <NA>
3 A <NA>
4 <NA> <NA>
5 <NA> <NA>
6 <NA> B
7 <NA> B
8 <NA> B
9 <NA> B
10 <NA> B
I can coalesce this into a single column, like this:
data.frame(AB = coalesce(foo$A, foo$B))
giving,
AB
1 A
2 A
3 A
4 <NA>
5 <NA>
6 B
7 B
8 B
9 B
10 B
which is nice. Now, say my data frame is huge with lots of columns. How do I coalesce that without naming each column individually? As far as I understand, coalesce is expecting vectors, so I don't see a neat and tidy dplyr solution where I can just pluck out the required columns and pass them en masse. Any ideas?
EDIT
As requested, a "harder" example.
foo <- data.frame(A=rep(NA,10),B=rep(NA,10),C=rep(NA,10),D=rep(NA,10),E=rep(NA,10),F=rep(NA,10),G=rep(NA,10),H=rep(NA,10),I=rep(NA,10),J=rep(NA,10))
foo$A[1] <- "A"
foo$B[2] <- "B"
foo$C[3] <- "C"
foo$D[4] <- "D"
foo$E[5] <- "E"
foo$F[6] <- "F"
foo$G[7] <- "G"
foo$H[8] <- "H"
foo$I[9] <- "I"
foo$J[10] <- "J"
How do I coalesce this without having to write:
data.frame(ALL= coalesce(foo$A, foo$B, foo$C, foo$D, foo$E, foo$F, foo$G, foo$H, foo$I, foo$J))
You can use do.call(coalesce, ...), which is a simpler way to write a function call with a lot of arguments:
library(dplyr)
do.call(coalesce, foo)
# [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
You can use this (documentation of purrr: pmap)
coalesce(!!!foo)

Sort dataframe with multiple columns for multiple years

I have a data.frame with multiple columns and first column being Year. I want to sort my data frame in descending order for each year. I have fifteen years of data and then over 3000 columns.
I illustrate as follows:
Year A B C D
2000 2 3 4 NA
2001 3 4 NA 1
Desired output, my data frame has NAs as well but I can not remove those.
Year C B A
2000 4 3 2
Year B A D
2001 4 3 1
And this verion as well
Year
2000 C B A
2001 B A D
I have scripted this code
Asc <-order(df[-1], decreasing=True)
But I'm unable to obtain my desired output. I have referred in R sort row data in ascending order but still its different for what I'm looking for.
Would appreciate your help in this regard.
We can use apply with MARGIN=1. We loop through the rows of the dataset (excluding the first column) with apply, get the index of non-NA elements ('i1'), order the non-NA values descendingly ('i2'), and use that to rearrange the column names of the dataset.
m1 <- t(apply(df1[-1], 1, function(x) {
i1 <- !is.na(x)
i2 <- order(-x[i1])
names(df1)[-1][i1][i2]}))
m1
# [,1] [,2] [,3]
#[1,] "C" "B" "A"
#[2,] "B" "A" "D"
If we need the values and also the names, a list approach would be more suitable as it won't create any problems in the class
lst <- apply(df1[-1], 1, function(x){
i1 <- !is.na(x)
list(sort(x[i1],decreasing=TRUE))})
lst
#[[1]]
#[[1]][[1]]
#C B A
#4 3 2
#[[2]]
#[[2]][[1]]
#B A D
#4 3 1
We can extract the names or the elements from the 'lst'
do.call(rbind, do.call(`c`,rapply(lst, names,
how='list')))
# [,1] [,2] [,3]
#[1,] "C" "B" "A"
#[2,] "B" "A" "D"
Or
t(sapply(do.call(c, lst), names))
and the values as
t(simplify2array(do.call(c, lst)))

creating a data frame with two colums each preserving the class characters of each assembled vector [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have two vectors one is a character vector and another a numeric vector. I am trying to assemble both in a data frame while preservinf their class properties, however none of the methods below seem to work as wither all appear in quotes or the opposite. How could I create a data frame where the vector element properties are preserved.
x <- c(1,2,3,4,5,6,7,8)
y <- c("a","b","c","d","e","f","g","h")
cbind(x,y)
as.data.frame(x,y)
as.data.frame(cbind(x,y))
EDIT - Desired Output
x y
1 "a"
2 "b"
3 "c"
4 "d"
5 "e"
6 "f"
7 "g"
8 "h"
Following several comments on it, please see that still does not appear.
sapply(data.frame(x,y, stringsAsFactors=FALSE), class)
x y
"numeric" "character"
data.frame(x,y, stringsAsFactors=FALSE)
x y
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
8 8 h
x
[1] 1 2 3 4 5 6 7 8
y
[2] "a" "b" "c" "d" "e" "f" "g" "h"
Your character vector is automatically turned into a vector of factors.
data.frame will allow a function input which doesn't turn character strings into factors:
df<-data.frame(x,y,stringsAsFactors=FALSE)
Edit: In light of clarification in the question by OP
Adds in the required quote marks using the escape character to print "y", instead of y
x <- c(1,2,3,4,5,6,7,8)
> y <- c("a","b","c","d","e","f","g","h")
> df<-data.frame(x,paste0("\"",y,"\""),stringsAsFactors = FALSE)
> names(df)<-c("x","y")
> print(df, row.names = FALSE)
x y
1 "a"
2 "b"
3 "c"
4 "d"
5 "e"
6 "f"
7 "g"
8 "h"
> sapply(df, class)
x y
"numeric" "character"

Resources