Trying to combine two dataframes together but being thrown an error - r

So I have two data frames that I am attempting to combine together. The two DFs share the following characteristics:
They have the same number of columns
Column names are the same
Equal amount of rows (100 each)
So the first column in each table is ID. One table has ID # 1 through 100 while the next one has ID # 101 through 200.
I have attempted to use the rbind function but it will throw an error that I can't figure out how to get around:
data3 <- rbind(data1,data2)
The error reads:
Error in `row.names<-.data.frame`(`*tmp*`, value = value) :
duplicate 'row.names' are not allowed
Does anyone have any advice on how to work around this?
Ultimately, I want them to be combined in to one single dataframe of IDs 1 through 200 and all the corresponding data in the columns.
So say data1 looks like:
ID Team Position
1 Pirates Pitcher
2 Yankees Catcher
3 Red Sox Outfield
And data2 looks like:
ID Team Position
4 Astros Pitcher
5 Brewers First
6 Dodgers Shortstop
I want the final result (data3) to look like:
ID Team Position
1 Pirates Pitcher
2 Yankees Catcher
3 Red Sox Outfield
4 Astros Pitcher
5 Brewers First
6 Dodgers Shortstop
By the way, these are not the names or data I'm working with. Just more of a simplified example.

Depending on whether you want to preserve the row names or not, you can do one of the following:
# Do not preserve row names
rownames(data1) <- NULL
rownames(data2) <- NULL
rbind(data1, data2)
# Preserve the row names
data1 <- cbind(rownames(data1), data1)
data2 <- cbind(rownames(data2), data2)
rownames(data1) <- NULL
rownames(data2) <- NULL
rbind(data1, data2)

This is quite easy! Since R code is very efficient, you can do this with just 3 lines of code.
File1 = read.table("C:\\Users\\your_path\\Desktop\\File1.txt")
File2 = read.table("C:\\Users\\your_path\\Desktop\\File2.txt")
All = merge(File1, File2, by.x = "Team", by.y = "Team", all = TRUE)

Related

Concatenate columns in data frame

We have brands data in a column/variable which is delimited by semicolon(;). Our task is to split these column data to multiple columns which we were able to do with the following syntax.
Attached the data as Screen shot.
Data set
Here is the R code:
x<-dataset$Pref_All
point<-df %>% separate(x, c("Pref_01","Pref_02","Pref_03","Pref_04","Pref_05"), ";")
point[is.na(point)] <- ""
However our question is: We have this type of brands data in more than 10 to 15 columns and if we use the above syntax the maximum number of columns to be split is to be decided on the number of brands each column holds (which we manually calculated and taken as 5 columns).
We would like to know is there any way where we can write the code in a dynamic way such that it should calculate the maximum number of brands each column holds and accordingly it should create those many new columns in a data frame. for e.g.
Pref_01,Pref_02,Pref_03,Pref_04,Pref_05.
the preferred output is given as a screen shot.
Output
Thanks for the help in advance.
x <- c("Swift;Baleno;Ciaz;Scross;Brezza", "Baleno;swift;celerio;ignis", "Scross;Baleno;celerio;brezza", "", "Ciaz;Scross;Brezza")
strsplit(x,";")
library(dplyr)
library(tidyr)
x <- data.frame(ID = c(1,2,3,4,5),
Pref_All = c("S;B;C;S;B",
"B;S;C;I",
"S;B;C;B",
" ",
"C;S;B"))
x$Pref_All <- as.character(levels(x$Pref_All))[x$Pref_All]
final_df <- x %>%
tidyr::separate(Pref_All, c(paste0("Pref_0", 1:b[[which.max(b)]])), ";")
final_df$ID <- x$Pref_All
final_df <- rename(final_df, Pref_All = ID)
final_df[is.na(final_df)] <- ""
Pref_All Pref_01 Pref_02 Pref_03 Pref_04 Pref_05
1 S;B;C;S;B S B C S B
2 B;S;C;I B S C I
3 S;B;C;B S B C B
4
5 C;S;B C S B
The trick for the column names is given by paste0 going from 1 to the maximum number of brands in your data!
I would use str_split() which returns a list of character vectors. From that, we can work out the max number of preferences in the dataframe and then apply over it a function to add the missing elements.
df=data.frame("id"=1:5,
"Pref_All"=c("brand1", "brand1;brand2;brand3", "", "brand2;brand4", "brand5"))
spl = str_split(df$Pref_All, ";")
# Find the max number of preferences
maxl = max(unlist(lapply(spl, length)))
# Add missing values to each element of the list
spl = lapply(spl, function(x){c(x, rep("", maxl-length(x)))})
# Bind each element of the list in a data.frame
dfr = data.frame(do.call(rbind, spl))
# Rename the columns
names(dfr) = paste0("Pref_", 1:maxl)
print(dfr)
# Pref_1 Pref_2 Pref_3
#1 brand1
#2 brand1 brand2 brand3
#3
#4 brand2 brand4
#5 brand5

R: Populating a data frame with multiple matches for a single value without looping

I have a working solution to this problem using a while-loop. I have been made aware that it is typically bad practice to use loops in R so was wondering of alternative approaches.
I have two dataframes, one single-column df full of gene names:
head(genes)
Genes
1 C1QA
2 C1QB
3 C1QC
4 CSF1R
5 CTSC
6 CTSS
And a two-column df that has pairs of the gene name (HGNC.symbol) and accompanying ensembl ID (Gene.stable.ID) for each transcript of the given gene:
head(ensembl_key)
Gene.stable.ID HGNC.symbol
1 ENSG00000210049 MT-TF
2 ENSG00000211459 MT-RNR1
3 ENSG00000210077 MT-TV
4 ENSG00000210082 MT-RNR2
5 ENSG00000209082 MT-TL1
6 ENSG00000198888 MT-ND1
My goal is to create a df that for each gene in the genes df extracts all corresponding transcript ID's (Gene.stable.ID) from the ensembl_key df.
The reason I have only found the looping solution is because a single entry in genes may have multiple matches in ensembl_key. I need to retain all matches and include them in the final df and I also do not know the number of matches a single ID from genes has a priori.
Here is my current working solution:
# Create large empty df to hold all transcripts
gene_transcript<- data.frame(matrix(NA, nrow= 5000, ncol= 2))
colnames(gene_transcript)<- c("geneID", "ensemblID")
# Populate Ensembl column
curr_gene<- 1
gene_count<- 1
while(gene_count <= dim(genes)[1]){
transcripts<- ensembl_key[which(ensembl_key$HGNC.symbol== genes$Genes[gene_count]),1]
if(length(transcripts)>1){
num<- length(transcripts)-1
gene_transcript$geneID[curr_gene:(curr_gene+num)]<- genes$Genes[curr_gene]
gene_transcript$ensemblID[curr_gene:(curr_gene+num)]<- transcripts
gene_count<- gene_count+1
curr_gene<- curr_gene + num + 1
}
else{
gene_transcript$geneID[curr_gene]<- genes$Genes[curr_gene]
gene_transcript$ensemblID[curr_gene]<- transcripts
gene_count<- gene_count+1
curr_gene<- curr_gene + 1
}
}
# Remove unneccessary columns
last_row<- which(is.na(gene_transcript$geneID)==T)[1]-1
gene_transcript<- gene_transcript[1:last_row,]
Any help is greatly appreciated, thanks!
It sounds like you want to join or merge. Several ways to do this, but the following should work.
merge(genes,
ensembl_key,
by.x = "Genes",
by.y = "HGNC.symbol")

merging multiple dataframes with duplicate rows in R

Relatively new with R for this kind of thing, searched quite a bit and couldn't find much that was helpful.
I have about 150 .csv files with 40,000 - 60,000 rows each and I am trying to merge 3 columns from each into 1 large data frame. I have a small script that extracts the 3 columns of interest ("id", "name" and "value") from each file and merges by "id" and "name" with the larger data frame "MergedData". Here is my code (I'm sure this is a very inefficient way of doing this and that's ok with me for now, but of course I'm open to better options!):
file_list <- list.files()
for (file in file_list){
if(!exists("MergedData")){
MergedData <- read.csv(file, skip=5)[ ,c("id", "name", "value")]
colnames(MergedData) <- c("id", "name", file)
}
else if(exists("MergedData")){
temp_data <- read.csv(file, skip=5)[ ,c("id", "name", "value")]
colnames(temp_data) <- c("id", "name", file)
MergedData <- merge(MergedData, temp_data, by=c("id", "name"), all=TRUE)
rm(temp_data)
}
}
Not every file has the same number of rows, though many rows are common to many files. I don't have an inclusive list of rows, so I included all=TRUE to append new rows that don't yet exist in the MergedData file.
My problem is: many of the files contain 2-4 rows with identical "id" and "name" entries, but different "value" entries. So, when I merge them I end up adding rows for every possible combination, which gets out of hand fast. Most frustrating is that none of these duplicates are of any interest to me whatsoever. Is there a simple way to take the value for the first entry and just ignore any further duplicate entries?
Thanks!
Based on your comment, we could stack each file and then cast the resulting data frame from "long" to "wide" format:
library(dplyr)
library(readr)
library(reshape2)
df = lapply(file_list, function(file) {
dat = read_csv(file)
dat$source.file = file
return(dat)
})
df = bind_rows(df)
df = dcast(df, id + name ~ source.file, value.var="value")
In the code above, after reading in each file, we add a new column source.file containing the file name (or a modified version thereof).* Then we use dcast to cast the data frame from "long" to "wide" format to create a separate column for the value from each file, with each new column taking one of the names we just created in source.file.
Note also that depending on what you're planning to do with this data frame, you may find it more convenient to keep it in long format (i.e., skip the dcast step) for further analysis.
Addendum: Dealing with Aggregation function missing: defaulting to length warning. This happens when you have more than one row with the same id, name and source.file. That means there are multiple values that have to get mapped to the same cell, resulting in aggregation. The default aggregation function is length (i.e., a count of the number of values in that cell). The only ways around this that I know of are (a) keep the data in long format, (b) use a different aggregation function (e.g., mean), or (c) add an extra counter column to differentiate cases with multiple values for the same combination of id, name, and source.file. We demonstrate these below.
First, let's create some fake data:
df = data.frame(id=rep(1:2,2),
name=rep(c("A","B"), 2),
source.file=rep(c("001","002"), each=2),
value=11:14)
df
id name source.file value
1 1 A 001 11
2 2 B 001 12
3 1 A 002 13
4 2 B 002 14
Only one value per combination of id, name and source.file, so dcast works as desired.
dcast(df, id + name ~ source.file, value.var="value")
id name 001 002
1 1 A 11 13
2 2 B 12 14
Add an additional row with the same id, name and source.file. Since there are now two values getting mapped to a single cell, dcast must aggregate. The default aggregation function is to provide a count of the number of values.
df = rbind(df, data.frame(id=1, name="A", source.file="002", value=50))
dcast(df, id + name ~ source.file, value.var="value")
Aggregation function missing: defaulting to length
id name 001 002
1 1 A 1 2
2 2 B 1 1
Instead, use mean as the aggregation function.
dcast(df, id + name ~ source.file, value.var="value", fun.aggregate=mean)
id name 001 002
1 1 A 11 31.5
2 2 B 12 14.0
Add a new counter column to differentiate cases where there are multiple rows with the same id, name and source.file and include that in dcast. This gets us back to a single value per cell, but at the expense of having more than one column for some source.files.
# Add counter column
df = df %>% group_by(id, name, source.file) %>%
mutate(counter=1:n())
As you can see, the counter value only has a value of 1 in cases where there's only one combination of id, name, and source.file, but has values of 1 and 2 for one case where there are two rows with the same id, name, and source.file (rows 3 and 5 below).
df
id name source.file value counter
1 1 A 001 11 1
2 2 B 001 12 1
3 1 A 002 13 1
4 2 B 002 14 1
5 1 A 002 50 2
Now we dcast with counter included, so we get two columns for source.file "002".
dcast(df, id + name ~ source.file + counter, value.var="value")
id name 001_1 002_1 002_2
1 1 A 11 13 50
2 2 B 12 14 NA
* I'm not sure what your file names look like, so you'll probably need to adjust this create a naming format with a unique file identifier. For example, if your file names follow the pattern "file001.csv", "file002.csv", etc., you could do this: dat$source.file = paste0("Value", gsub("file([0-9]{3})\\.csv", "\\1", file).

Alternative to using table in R?

I have a function called notes_count(id) that takes a vector as a parameter (for example the function can accept different arguments 5, c(1,2,3), 6:20, or 5:1 to name a few) and returns the ID and "count" of the notes. I have a data frame with the following contents:
"ID" "Date" "Notes"
that contains an unknown amount of entries per "ID" for example:
ID Date Notes
1 xxx "This is a note"
1 xxx "More notes here"
...
8 xxx "Hello World"
The problem I am running into is that I want the output to be ordered in the same way as the input vector meaning notes_count(3:1) should list the results in reverse order as a data frame:
ID notes_count
1 3 6
2 2 288
3 1 102
and calling notes_count(1:3) would result in:
ID notes_count
1 1 102
2 2 288
3 3 6
however table always reorders from min to max despite the order it is given originally. Is there a way to do what table is doing directly on the data frame but using other functions so that I can control the output.
Currently my code is this:
#Before calling table I have data frame "notes" in the order I want but table reorders it
notes_count <- as.data.frame(table(notes[["ID"]]))
which seems silly to make the original data frame a table and then convert it back.
EDIT:
Here is my code as basic as it is as requested
notes_count <- function(id){
## notes.csv format
## "ID","Date","Notes"
## 1,"2016-01-01","Some notes"
#read the csv to a data frame
notes <- read.csv("notes.csv")
#remove all NA values
notes <- notes[complete.cases(notes), ]
#here is where you can order the data but it won't matter when aggregating the notes to a "count" using table on the next line
notes <- notes[id, ]
#convert the table back to a data frame
notes_count <- as.data.frame(table(notes[["ID"]]))
notes_count
}
Here's a simplified example that should get you going:
set.seed(1234)
notes <- data.frame(id=sample(2:10,size = 100, replace = TRUE), Note="Some note")
notes_count <- function(id) {
counts <- table(notes[notes$id %in% id,])
return(data.frame(count=counts[as.character(id),]))
}
notes_count(c(10,2,5))
# Results
count
10 8
2 12
5 2
If I understand correctly, you want to sort the dataframe by the notes_count variable?
then use order function and reshuffle the df rows.
your_data_frame[order(your_data_frame$notes_count,decreasing=TRUE),]

Creating a count matrix from factor level occurences in a list of dataframes

Since i cannot give example data, here are two small textfiles representing the first 5 lines of two of my input files:
https://www.dropbox.com/sh/s0rmi2zotb3dx3o/AAAq0G3LbOokfN8MrYf7jLofa?dl=0
I read all textfiles in the working directory into a list, cut some columns, set new names and subset by a numerical cutoff in the third column:
all.files <- list.files(pattern = ".*.txt")
data.list <- lapply(all.files, function(x)read.table(x, sep="\t"))
names(data.list) <- all.files
data.list <- lapply(data.list, function(x) x[,1:3])
new.names<-c("query", "sbjct", "ident")
data.list <- lapply(data.list, setNames, new.names)
new.list <- lapply(data.list, function(x) subset(x, ident>99))
I am ending up with a list of dataframes, which consist of three columns each.
Now, i want to
count the occurences of factors in the column "sbjct" in all dataframes in the list, and
build a matrix from the counts, in which rows=factor levels of "sbjct" and columns=occurences in each dataframe.
For each dataframe in the list, a new object with two columns (sbjct/counts) should be created named according to the original dataframe in the original list. In the end, all the new objects should be merged with cbind (for example), and empty cells (data absent) should be filled with zeros, resulting in a "sbjct x counts" matrix.
For example, if i would have a single dataframe, dplyr would help me like this:
library(dplyr)
some.object <- some.dataframe %>%
group_by(sbjct) %>%
summarise(counts = length(sbjct))
>some.object
Source: local data frame [5 x 2]
sbjct counts
1 AB619702.1.1454 1
2 EU287121.1.1497 1
3 HM062118.1.1478 1
4 KC437137.1.1283 1
5 Yq2He155 1
But it seems it cannot be applied to lists of dataframes.
Add a column to each data set which acts as indicator [lets name that Ndata] that the particular observation is coming from that dataset. Now rbind all these data sets.
Now when you make a cross table of sbjct X Ndata , you'll get the matrix that you are looking for.
here is some code to clarify:
t=c("a","b","c","d","e","f")
set.seed(10)
d1=data.frame(sbjt=sample(t,sample(20,1),rep=T))
d2=data.frame(sbjt=sample(t,sample(20,1),rep=T))
d3=data.frame(sbjt=sample(t,sample(20,1),rep=T))
d1$Ndata=rep("d1",nrow(d1))
d2$Ndata=rep("d2",nrow(d2))
d3$Ndata=rep("d3",nrow(d3))
all=rbind(d1,d2,d3)
ct=table(all$sbjt,all$Ndata)
ct looks like this:
> ct
d1 d2 d3
a 1 0 0
b 4 0 1
c 2 2 1
d 3 1 0
e 1 0 0
>

Resources