I know it's a newbie question, I have these 3 xlsx files with 3 three data bases of the same 14 variables,its a cross section data panel ,
All I want is to concatenate them in one single data base called eplt,
First, I import them
library(dplyr)
library(ggplot2)
library(xlsx)
##Import the three data bases
epl_data<-read.xlsx("Notes_ETAB2016-2017.xlsx",sheetIndex = 1,header = TRUE)
epl_data2<-read.xlsx("Notes_ETAB2017-2018.xlsx",sheetIndex = 1,header = TRUE)
epl_data3<-read.xlsx("Notes_ETAB2018-2019.xlsx",sheetIndex = 1,header = TRUE)
## to render the number of rows in each of them
nrow(epl_data)
nrow(epl_data2)
nrow(epl_data3)
# I want to rbind the three sets together
eplt<-rbind(epl_data,epl_data2,epl_data3)
the total number of rows is 29441, but when applying Rbind to bind them all together I get the error
> eplt<-rbind(epl_data,epl_data2,epl_data3)
Error in match.names(clabs, names(xi)) :
names do not match previous names
but the names of the variables in the 3 sets are the same
could someone please help, I only want to rebind 25000 observations, and leave the rest 4441 to compare it with the predictable obs of a multiple regression model,
thanks in advance
The third dataframes doesn't have the same names as the first two: Svt isn't to upper cases.
One way is to apply the names of one dataframe to the others:
colnames(epl_data2) <- colnames(epl_data)
colnames(epl_data3) <- colnames(epl_data)
But i recommand the package janitor whenever your data comes from Excel files. Indeed, it is common to have variable names issues. This package ensure a good formatting of your data column names:
epl_data <- janitor::clean_names(epl_data)
epl_data2 <- janitor::clean_names(epl_data2)
epl_data3 <- janitor::clean_names(epl_data3)
Therefore, the rbind should work
As already mentioned you have a mismatch in the variable name 'SVT'. Here is an alternative that would make the column names lower case and bind them together in one dataframe.
library(dplyr)
library(purrr)
eplt <- list.files(pattern = 'Notes_ETAB2016-\\d+\\.xlsx') %>%
map_df(~readxl::read_excel(.x) %>% rename_with(~tolower(.)))
I am looking to transpose my data set.
Pretext: As QIIME2 outputs so that 'Feature.ID' (essentially bacterial species) are rows and Sites are columns, with abundance in the cell values. As many R packages such as Biodiversity require the sites are rows and species as columns, I am looking to transpose my data.
library(tidyverse)
Ftable <- read_tsv('table.feature-table_biom.txt', col_names = TRUE, skip = 1) #opens my QIIME2 file, removing the top line
names(Ftable)[names(Ftable)=="#OTU ID"] <- "Feature.ID" #Renames the species column
Ftable <- cbind(Ftable, "observation"=1:nrow(Ftable)) #adds an indexing column 'observation' so that I can later remove the column containing the species names as they are complex and I do not wish to type them out
Ftable <- Ftable %>% select(observation, everything()) #Moves observation to the front
OtuOb <- Ftable
OtuOb <- as.tibble(OtuOb)
write_tsv(OtuOb, "OtuToObservationReference.tsv") #These 3 lines save a reference so I can look to see which observations align to which species
Ftable <- Ftable[,-2] #removes 'Feature.ID' column so that the observation column shows the species
Ftable <- t(Ftable) #I have been trying to get this to work but it doesnt
Ftable <- as.tibble(Ftable)
write_tsv(Ftable, "Ftable.tsv")
When transposing it removes the sample references (along the top in the original) which means I have no way of seeing how they match up to the abundancies of that species.
Small sample of my data
Before transposing, assign your row names to an object, do the transpose, then assign the stored row names to the new column names:
Ftable_rownames <- rownames(Ftable)
# transpose
Ftable <- t(Ftable)
# assign names from the stored object
colnames(Ftable) <- Ftable_rownames
I would like to assign names to rows in R but so far I have only found ways to assign names to columns. My data is in two columns where the first column (geo) is assigned with the name of the specific location I'm investigating and the second column (skada) is the observed value at that specific location. To clarify, I want to be able to assign names for every location instead of just having them all in one .txt file so that the data is easier to work with. Anyone with more experience than me that knows how to handle this in R?
First you need to import the data to your global environment. Try the function read.table()
To name rows, try
(assuming your data.frame is named df):
rownames(df) <- df[, "geo"]
df <- df[, -1]
Well, your question is not that clear...
I assume you are trying to create a data.frame with named rows. If you look at the data.frame help you can see the parameter row.names description
NULL or a single integer or character string specifying a column to be used as row names, or a character or integer vector giving the row names for the data frame.
which means you can manually specify the row names when you create the data.frame or the column containing the names. The former can be achived as follows
d = data.frame(x=rnorm(10), # 10 random data normally distributed
y=rnorm(10), # 10 random data normally distributed
row.names=letters[1:10] # take the first 10 letters and use them as row header
)
while the latter is
d = data.frame(x=rnorm(10), # 10 random data normally distributed
y=rnorm(10), # 10 random data normally distributed
r=letters[1:10], # take the first 10 letters
row.names=3 # the column with the row headers is the 3rd
)
If you are reading the data from a file I will assume you are using the command read.table. Many of its parameters are the same of data.frame, in particular you will find that the row.headers parameter works the same way:
a vector of row names. This can be a vector giving the actual row names, or a single number giving the column of the table which contains the row names, or character string giving the name of the table column containing the row names.
Finally, if you have already read the data.frame and you want to change the row names, Pierre's answer is your solution
I have a large CSV file, say INPUT, with about 500+ columns. I also have a dataframe DF that contains a subset of the column headers of INPUT which changes at every iteration.
I have to fetch the data from only those columns of INPUT that is present in the dataframe DF and write it into another CSV file, say OUTPUT.
In short,
INPUT.csv:
ID,Col_A,Col_B,Col_C,Col_D,Col_E,Col_F,,,,,,,,,,,,,Col_S,,,,,,,,,,,,,,,,Col_Z
1,009,abcd,67,xvz,33,50,,,,,,,,,,,,,,,,,,,,,,,,,,,,oup,,,,,,,,,,,,,,,,,,90
2,007,efgh,87,wuy,56,67,,,,,,,,,,,,,,,,,,,,,,,,,,,,ghj,,,,,,,,,,,,,,,,,,,,888
print(DF):
[1] "Col_D" "Col_Z"
[3] "Col_F" "Col_S"
OUTPUT.csv
ID,Col_D,Col_Z,Col_F,Col_S
1,xvz,90,50,oup
2,wuy,888,67,ghj
I'm a beginner when it comes to R. I would prefer for the matching of dataframe with the INPUT file to be automated, because i don't want to do this everyday when the dataframe gets updated.
I'm not sure whether this is the answer :
input <- read.table(...)
input[colnames(input) %in% colnames(DF)]
if I understand it correctly, you need to import the INPUT.csv file inside R and then match the columns of your DF with those columns of your INPUT, is that correct?
you can either use the match function or just import the INPUT.csv file inside RStudio via "Import Dataset" button and subset it. Subsetting of imported dataframes is fairly easy.
If you will import your dataset as INPUT, then you can make the subset of these columns in following way: INPUT[,c(1,2,4)]
and that will get you first, second and fourth column of the INPUT dataset.
First, to upload the csv is simple:
dataframe_read <- read.csv('/Path/to/csv/')
If I understand correctly that one dataframes columns is always a subset, the code is as follows:
### Example Dataframes
df1 <- data_frame(one = c(1,3,4), two= c(1,2,3), three = c(1,2,3))
df2 <- data_frame(one = c(1,3,4), three= c(1,2,3))
### Make new data frame
df3 <- df1[,colnames(df2)]
### Write new dataframe
write.csv(df3, 'hello.csv')