R: row.names and data manipulation / export - r

I am having some issues understanding what row.names is and how it works. And, how I can get my data to do stuff the row.names allows one to do.
For example, I am creating some clusters with the code below (my data). I want to export the results which is what the sapply line does, but only to the screen for now. The first column (path_country) of my data frame are country names and the other columns are other variables (integers). I don't see an easy way to export these clusters to a table or list of countries and their group membership.
I tried to make a dummy example using example data sets in R. For example, mtcars, it was then that I noticed the first column was denoted as row.names. With mtcars I can create clusters, cutree to the specified number of groups and then save as a data frame. With this approach I have the 'car names' in the first column and the group number in the second column (more or less, could be cleaned up to look nicer, but is essentially what I am after), which is what I would like to happen with my data.
Any thoughts on this would be appreciated.
# my data
path_country <- read.csv("C:/path_country.csv")
patho <- subset(path_country, select=c(2:188))
patho.d <- dist(patho)
patho.hclust <- hclust(patho.d)
patho.hclust.groups11 = cutree(patho.hclust,11)
sapply(unique(patho.hclust.groups11),function(g)path_country$Country[patho.hclust.groups11 == g])
# mtcars data
car.d <- dist(mtcars)
car.h <- hclust(car.d)
car.h.11 <- cutree(car.h, 11)
nice_result <- as.data.frame(car.h.11)
write.table(nice_result, "test.txt", sep="\t")

1) You can create data.frame with row.names from CSV file:
# Names in the first column
path_country <- read.table("C:/path_country.csv", row.names=1)
# Names in column "Country"
path_country <- read.table("C:/path_country.csv", row.names="Country", head=TRUE)
Note, that in second case you should specify head=TRUE in order to use columns' names.
Now rownames(path_country) should give you vector with rows' names, and as.data.frame(patho.hclust.groups11) nice result for export.
2) At any time you can specify rows' names for your data.frame with command:
rownames(path_country) <- names.vector
where names.vector is a vector with unique names of length equal to number of rows in data.frame. In your example:
rownames(patho.hclust.groups11) <- path_country$Country
Note, that if you are using first approach you don't need this command.

Related

rbind three data bases using Rbind function

I know it's a newbie question, I have these 3 xlsx files with 3 three data bases of the same 14 variables,its a cross section data panel ,
All I want is to concatenate them in one single data base called eplt,
First, I import them
library(dplyr)
library(ggplot2)
library(xlsx)
##Import the three data bases
epl_data<-read.xlsx("Notes_ETAB2016-2017.xlsx",sheetIndex = 1,header = TRUE)
epl_data2<-read.xlsx("Notes_ETAB2017-2018.xlsx",sheetIndex = 1,header = TRUE)
epl_data3<-read.xlsx("Notes_ETAB2018-2019.xlsx",sheetIndex = 1,header = TRUE)
## to render the number of rows in each of them
nrow(epl_data)
nrow(epl_data2)
nrow(epl_data3)
# I want to rbind the three sets together
eplt<-rbind(epl_data,epl_data2,epl_data3)
the total number of rows is 29441, but when applying Rbind to bind them all together I get the error
> eplt<-rbind(epl_data,epl_data2,epl_data3)
Error in match.names(clabs, names(xi)) :
names do not match previous names
but the names of the variables in the 3 sets are the same
could someone please help, I only want to rebind 25000 observations, and leave the rest 4441 to compare it with the predictable obs of a multiple regression model,
thanks in advance
The third dataframes doesn't have the same names as the first two: Svt isn't to upper cases.
One way is to apply the names of one dataframe to the others:
colnames(epl_data2) <- colnames(epl_data)
colnames(epl_data3) <- colnames(epl_data)
But i recommand the package janitor whenever your data comes from Excel files. Indeed, it is common to have variable names issues. This package ensure a good formatting of your data column names:
epl_data <- janitor::clean_names(epl_data)
epl_data2 <- janitor::clean_names(epl_data2)
epl_data3 <- janitor::clean_names(epl_data3)
Therefore, the rbind should work
As already mentioned you have a mismatch in the variable name 'SVT'. Here is an alternative that would make the column names lower case and bind them together in one dataframe.
library(dplyr)
library(purrr)
eplt <- list.files(pattern = 'Notes_ETAB2016-\\d+\\.xlsx') %>%
map_df(~readxl::read_excel(.x) %>% rename_with(~tolower(.)))

How can I get the column/variable names of a dataframe that fit certain parameters?

I came across a problem in my DataCamp exercise that basically asked "Remove the column names in this vector that are not factors." I know what they -wanted- me to do, and that was to simply do glimpse(df) and manually delete elements of the vector containing the column names, but that wasn't satisfying for me. I figured there was a simple way to store the column names of the dataframe that are factors into a vector. So, I tried two things that ended up working, but I worry they might be inefficient.
Example data Frame:
factorVar <- as.factor(LETTERS[1:10])
df1 <- data.frame(x = 1, y = 1:10, factorVar = sample(factorVar, 10))
My first solution was this:
vector1 <- names(select_if(df1, is.factor))
This worked, but select_if returns an entire tibble of a filtered dataframe and then gets the column names. Surely there's an easier way...
Next, I tried this:
vector2 <- colnames(df1)[sapply(df1,is.factor)]
This also worked, but I wanted to know if there's a quicker, more efficient way of filtering column names based on their type and then storing the results as a vector.

How to transpose data in R without losing information?

I am looking to transpose my data set.
Pretext: As QIIME2 outputs so that 'Feature.ID' (essentially bacterial species) are rows and Sites are columns, with abundance in the cell values. As many R packages such as Biodiversity require the sites are rows and species as columns, I am looking to transpose my data.
library(tidyverse)
Ftable <- read_tsv('table.feature-table_biom.txt', col_names = TRUE, skip = 1) #opens my QIIME2 file, removing the top line
names(Ftable)[names(Ftable)=="#OTU ID"] <- "Feature.ID" #Renames the species column
Ftable <- cbind(Ftable, "observation"=1:nrow(Ftable)) #adds an indexing column 'observation' so that I can later remove the column containing the species names as they are complex and I do not wish to type them out
Ftable <- Ftable %>% select(observation, everything()) #Moves observation to the front
OtuOb <- Ftable
OtuOb <- as.tibble(OtuOb)
write_tsv(OtuOb, "OtuToObservationReference.tsv") #These 3 lines save a reference so I can look to see which observations align to which species
Ftable <- Ftable[,-2] #removes 'Feature.ID' column so that the observation column shows the species
Ftable <- t(Ftable) #I have been trying to get this to work but it doesnt
Ftable <- as.tibble(Ftable)
write_tsv(Ftable, "Ftable.tsv")
When transposing it removes the sample references (along the top in the original) which means I have no way of seeing how they match up to the abundancies of that species.
Small sample of my data
Before transposing, assign your row names to an object, do the transpose, then assign the stored row names to the new column names:
Ftable_rownames <- rownames(Ftable)
# transpose
Ftable <- t(Ftable)
# assign names from the stored object
colnames(Ftable) <- Ftable_rownames

Assigning name to rows in R

I would like to assign names to rows in R but so far I have only found ways to assign names to columns. My data is in two columns where the first column (geo) is assigned with the name of the specific location I'm investigating and the second column (skada) is the observed value at that specific location. To clarify, I want to be able to assign names for every location instead of just having them all in one .txt file so that the data is easier to work with. Anyone with more experience than me that knows how to handle this in R?
First you need to import the data to your global environment. Try the function read.table()
To name rows, try
(assuming your data.frame is named df):
rownames(df) <- df[, "geo"]
df <- df[, -1]
Well, your question is not that clear...
I assume you are trying to create a data.frame with named rows. If you look at the data.frame help you can see the parameter row.names description
NULL or a single integer or character string specifying a column to be used as row names, or a character or integer vector giving the row names for the data frame.
which means you can manually specify the row names when you create the data.frame or the column containing the names. The former can be achived as follows
d = data.frame(x=rnorm(10), # 10 random data normally distributed
y=rnorm(10), # 10 random data normally distributed
row.names=letters[1:10] # take the first 10 letters and use them as row header
)
while the latter is
d = data.frame(x=rnorm(10), # 10 random data normally distributed
y=rnorm(10), # 10 random data normally distributed
r=letters[1:10], # take the first 10 letters
row.names=3 # the column with the row headers is the 3rd
)
If you are reading the data from a file I will assume you are using the command read.table. Many of its parameters are the same of data.frame, in particular you will find that the row.headers parameter works the same way:
a vector of row names. This can be a vector giving the actual row names, or a single number giving the column of the table which contains the row names, or character string giving the name of the table column containing the row names.
Finally, if you have already read the data.frame and you want to change the row names, Pierre's answer is your solution

Comparing column headers of two files to fetch data in R

I have a large CSV file, say INPUT, with about 500+ columns. I also have a dataframe DF that contains a subset of the column headers of INPUT which changes at every iteration.
I have to fetch the data from only those columns of INPUT that is present in the dataframe DF and write it into another CSV file, say OUTPUT.
In short,
INPUT.csv:
ID,Col_A,Col_B,Col_C,Col_D,Col_E,Col_F,,,,,,,,,,,,,Col_S,,,,,,,,,,,,,,,,Col_Z
1,009,abcd,67,xvz,33,50,,,,,,,,,,,,,,,,,,,,,,,,,,,,oup,,,,,,,,,,,,,,,,,,90
2,007,efgh,87,wuy,56,67,,,,,,,,,,,,,,,,,,,,,,,,,,,,ghj,,,,,,,,,,,,,,,,,,,,888
print(DF):
[1] "Col_D" "Col_Z"
[3] "Col_F" "Col_S"
OUTPUT.csv
ID,Col_D,Col_Z,Col_F,Col_S
1,xvz,90,50,oup
2,wuy,888,67,ghj
I'm a beginner when it comes to R. I would prefer for the matching of dataframe with the INPUT file to be automated, because i don't want to do this everyday when the dataframe gets updated.
I'm not sure whether this is the answer :
input <- read.table(...)
input[colnames(input) %in% colnames(DF)]
if I understand it correctly, you need to import the INPUT.csv file inside R and then match the columns of your DF with those columns of your INPUT, is that correct?
you can either use the match function or just import the INPUT.csv file inside RStudio via "Import Dataset" button and subset it. Subsetting of imported dataframes is fairly easy.
If you will import your dataset as INPUT, then you can make the subset of these columns in following way: INPUT[,c(1,2,4)]
and that will get you first, second and fourth column of the INPUT dataset.
First, to upload the csv is simple:
dataframe_read <- read.csv('/Path/to/csv/')
If I understand correctly that one dataframes columns is always a subset, the code is as follows:
### Example Dataframes
df1 <- data_frame(one = c(1,3,4), two= c(1,2,3), three = c(1,2,3))
df2 <- data_frame(one = c(1,3,4), three= c(1,2,3))
### Make new data frame
df3 <- df1[,colnames(df2)]
### Write new dataframe
write.csv(df3, 'hello.csv')

Resources