I have multiple csv files, and these files contain some identical columns as well as different columns.
For example,
#1st.csv
col1,col2
1,2
#2nd.csv
col1,col3,col4
1,2,3
#3rd.csv
col1,col2,col3,col5
1,2,3,4
I try to combine these files based on the same columns, but for those different columns, I simply
include all columns but fill the cell with NA (for those data without that columns).
So I expect to see:
col1,col2,col3,col4,col5
1,2,NA,NA,NA #this is 1st.csv
1,NA,2,3,NA #this is 2nd.csv
1,2,3,NA,4 #this is 3rd.csv
Here is the r code I give, but it returns an error message
> Combine_data <- smartbind(1st,2nd,3rd)
Error in `[<-.data.frame`(`*tmp*`, , value = list(ID = c(1001, 1001, :
replacement element 1 has 143460 rows, need 143462
Does anyone know any alternative or elegant way to get the expected result?
The R version is 3.3.2.
You should be able to accomplish this with the bind_rows function from dplyr
df1 <- read.csv(text = "col1, col2
1,2", header = TRUE)
df2 <- read.csv(text = "col1, col3, col4
1,2,3", header = TRUE)
df3 <- read.csv(text = "col1, col2, col3, col5
1,2,3,4", header = TRUE)
library(dplyr)
res <- bind_rows(df1, df2, df3)
> res
col1 col2 col3 col4 col5
1 1 2 NA NA NA
2 1 NA 2 3 NA
3 1 2 3 NA 4
Related
I want merge 2 dataframes based on a shared pattern.
The pattern is the ID name (here in bold): ID=HAND2;ACS=20 as "ID=(.+);ACS"
If the ID is a match in both dataframes, then combine the respective rows!
DF1 DF2 MERGED ( DF2 + DF1 )
col1 col2 col1 col2 col1 col2 col3 col4
HAND2 H2 OFS ID=GATA5;ACS=45 OFS ID=GATA5;ACS=45
HAND6 H6 FAM ID=HAND2;ACS=20 FAM ID=HAND2;ACS=20 HAND2 H2
In this example (HAND2) ID is matched, then, DF1 and DF2 matched rows are combined/merged.
Script tried
MERGED <- merge(data.frame(DF1, row.names=NULL), data.frame(DF2, row.names=NULL), by = ("ID=(.+);ACS"), all = TRUE)[-1]
error
Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column
I am struggling in finding a similar command, where in alternative to column-names, I can instead match dataframes rows by a shared pattern.
Thank you in advance for your help.
You may try fuzzyjoin. In the match_fun argument you can define a function for your specific needs.
In your case gsub is extracting the pattern of the DF2 col2 variable. And with str_detect the extraction is compared to the col1 column of DF1.
Data
DF1 <- read.table(text = "col1 col2
HAND2 H2
HAND6 H6", header = T)
DF2 <- read.table(text = "col1 col2
OFS ID=GATA5;ACS=45
FAM ID=HAND2;ACS=20", header = T)
Code
library(fuzzyjoin)
library(stringr)
DF2 %>%
fuzzy_left_join(DF1,
by = c("col2"= "col1"),
match_fun = function(x,y) str_detect(y, gsub("ID=(.+);(.*)", "\\1", x)) )
Output
col1.x col2.x col1.y col2.y
1 OFS ID=GATA5;ACS=45 <NA> <NA>
2 FAM ID=HAND2;ACS=20 HAND2 H2
I have a data frame that looks like this
col1 <- c("test-1", "test-2","test")
col2 <- c(1,2,3)
df <- data.frame(col1,col2)
I would to like separate col1 and my data look like this
check1 check2 col2
test 1 1
test 2 2
test NA 3
a function like this would not work
separate(df, col1, c("check1,check2"),"-")
any idea why?
use fill = 'right' to fill NAs in case of missing values and prevent displaying any warnings
col1 <- c("test-1", "test-2","test")
col2 <- c(1,2,3)
df <- data.frame(col1,col2)
library(tidyverse)
df %>% separate(col1, into = c('checkA', 'checkB'), sep = '-', fill = 'right')
#> checkA checkB col2
#> 1 test 1 1
#> 2 test 2 2
#> 3 test <NA> 3
Created on 2021-06-01 by the reprex package (v2.0.0)
Regarding the OP's issue, instead of creating a vector of column names, there is a syntax issue i.e. c("check1,check2") is a single element and it should be
c("check1","check2")
separate(df, col1, c("check1","check2"),"-")
dears!
Summarizing my problem in a small example...
I want to append a row in data.frame using a list of variables with the same name of the data.frame columns, like this:
#createing a blank data.frame
df <- data.frame(matrix(ncol=3, nrow=0))
#naming the header
head <- c("col1", "col2", "col3")
# assigning header to data.frame
colnames(df) <- head
# creating three variables with the same name of header
col1 <- 1
col2 <- 2
col3 <- 3
#appending the row
rbind(df, list(col1, col2, col3))
The code runs, but the df continues blank. I would like a result like this for df:
col1 col2 col3
1 2 3
Help me with this rbind.
If you use the names() function, you can rename the columns in R
#createing a blank data.frame
df <- data.frame(matrix(ncol=3, nrow=0))
#naming the header
head <- c("col1", "col2", "col3")
# assigning header to data.frame
colnames(df) <- head
# creating three variables with the same name of header
col1 <- 1
col2 <- 2
col3 <- 3
#appending the row
df2 <- rbind(df, list(col1, col2, col3))
names(df2) <- c("col1", "col2", "col3")
df2
produces the output below
col1 col2 col3
1 2 3
Below I have two columns of data (column 6 and 7) of genus and species names. I would like to combine those two columns with character string data into a new column with the names combined.
I am quite new to R and the code below does not work! Thank you for the help wonderful people of stack overflow!
#TRYING TO MIX GENUS & SPECIES COLUMN
accepted_genus <- merged_subsets_2[6]
accepted_species <- merged_subsets_2[7]
accepted_genus
accepted_species
merged_subsets_2%>%
bind_cols(accepted_genus, accepted_species)
merged_subsets_2
We can use str_c from stringr
library(dplyr)
library(stringr)
df %>%
mutate(Col3 = str_c(Col1, Col2))
Or with unite
library(tidyr)
df %>%
unite(Col3, Col1, Col2, sep="", remove = FALSE)
Please take a look at this if this doesn't answer your question.
df <- data.frame(Col1 = letters[1:2], Col2=LETTERS[1:2]) # Sample data
> df
Col1 Col2
1 a A
2 b B
df$Col3 <- paste0(df$Col1, df$Col2) # Without spacing
> df
Col1 Col2 Col3
1 a A aA
2 b B bB
df$Col3 <- paste(df$Col1, df$Col2)
> df
Col1 Col2 Col3
1 a A a A
2 b B b B
I have data frame that I have to initialized as empty data frame.
Now I have only column available, I want to add it to empty data frame. How I can do it? I will not sure what will be length of column in advance.
Example
df = data.frame(a= NA, b = NA, col1= NA)
....
nrow(col1) # Here I will know length of column, and I only have one column available here.
df$col1 <- col1
error is as follows:
Error in `$<-.data.frame`(`*tmp*`, "a", value = c("1", :
replacement has 5 rows, data has 1
Any help will be greatful
use cbind
df = data.frame(a= NA, b = NA)
col1 <- c(1,2,3,4,5)
df <- cbind(df, col1)
# a b col1
# 1 NA NA 1
# 2 NA NA 2
# 3 NA NA 3
# 4 NA NA 4
# 5 NA NA 5
After your edits, you can still use cbind, but you'll need to drop the existing column first (or handle the duplicate columns after the cbind)
cbind(df[, 1:2], col1)
## or if you don't know the column indeces
## cbind(df[, !names(df) %in% c("col1")], col1)
A little workaround with lists:
l <- list(a=NA, b=NA, col1=NA)
col1 <- c(1,2,3)
l$col1 <- col1
df <- as.data.frame(l)
I like both answers provided by Symbolix and maRtin, I have done my own hack. My hack is as follow.
df[1:length(a),"a"] = a
However, I am not sure, which one this method is efficient in term of time. What will be big O notion for time