How to transpose data in R without losing information? - r

I am looking to transpose my data set.
Pretext: As QIIME2 outputs so that 'Feature.ID' (essentially bacterial species) are rows and Sites are columns, with abundance in the cell values. As many R packages such as Biodiversity require the sites are rows and species as columns, I am looking to transpose my data.
library(tidyverse)
Ftable <- read_tsv('table.feature-table_biom.txt', col_names = TRUE, skip = 1) #opens my QIIME2 file, removing the top line
names(Ftable)[names(Ftable)=="#OTU ID"] <- "Feature.ID" #Renames the species column
Ftable <- cbind(Ftable, "observation"=1:nrow(Ftable)) #adds an indexing column 'observation' so that I can later remove the column containing the species names as they are complex and I do not wish to type them out
Ftable <- Ftable %>% select(observation, everything()) #Moves observation to the front
OtuOb <- Ftable
OtuOb <- as.tibble(OtuOb)
write_tsv(OtuOb, "OtuToObservationReference.tsv") #These 3 lines save a reference so I can look to see which observations align to which species
Ftable <- Ftable[,-2] #removes 'Feature.ID' column so that the observation column shows the species
Ftable <- t(Ftable) #I have been trying to get this to work but it doesnt
Ftable <- as.tibble(Ftable)
write_tsv(Ftable, "Ftable.tsv")
When transposing it removes the sample references (along the top in the original) which means I have no way of seeing how they match up to the abundancies of that species.
Small sample of my data

Before transposing, assign your row names to an object, do the transpose, then assign the stored row names to the new column names:
Ftable_rownames <- rownames(Ftable)
# transpose
Ftable <- t(Ftable)
# assign names from the stored object
colnames(Ftable) <- Ftable_rownames

Related

How to remove columns in a dataframe in R based on the values from another vector?

I have a list of values in a vector that i would like to use in deleting the columns found in a dataframe.
For example if my data frame has columns A,B,C,D,E,F,G,H
and my vector has values of C,E,H
i would like my data frame to have columns
A,B,D,F,G,
There are different options. If we want to remove from the original dataset, assigning to NULL is quick
df1[vecofnames] <- NULL
Another option if we want to subset it to a different object
df2 <- df1[setdiff(names(df1), vecofnames)]
Or with subset
df2 <- subset(df1, select = -vecofnames)
Or in dplyr
library(dplyr)
df2 <- df1 %>%
select(-vecofnames)

R How to match dataframes to retrieve elements

I have a dataframe with 5000 rows, containing municipalities data from which I need to extract only rows matching a specific set of names. I am iterating the set through my dataframe using for loop.
This is for R 3.6.0
data <- NULL
for (i in mun.names){
data <- area.mun[area.mun[, 1] == i, ]
}
The object mun.names contain the municipalities I need to match. The object area.mun has the two columns NAME and AREA. The first column of both objects has municipalities names formatted accordingly.
At the end of the for loop my resulting object data always has only one value, the last municipality of the object area.mun.
This is a simple error. I appreciate any kind of feedback.
Convert your 'mun.names' to data frame:
mun.names <- data.frame(mun.names)
Change the column name to 'NAME':
colnames(mun.names) <- c(NAME)
Convert your 'area.mun' to data frame:
area.mun <- data.frame(area.mun)
Use merge command to extract the matched rows:
df <- merge(area.mun,mun.names,by.x="NAME",by.y="NAME")
You can also get all the unmatched rows from mun.names and area.mun data frames using all.x=TRUE and all.y=TRUE
df <- merge(area.mun,mun.names,by.x="NAME",by.y="NAME",all.x=TRUE, all.y=TRUE)

R: row.names and data manipulation / export

I am having some issues understanding what row.names is and how it works. And, how I can get my data to do stuff the row.names allows one to do.
For example, I am creating some clusters with the code below (my data). I want to export the results which is what the sapply line does, but only to the screen for now. The first column (path_country) of my data frame are country names and the other columns are other variables (integers). I don't see an easy way to export these clusters to a table or list of countries and their group membership.
I tried to make a dummy example using example data sets in R. For example, mtcars, it was then that I noticed the first column was denoted as row.names. With mtcars I can create clusters, cutree to the specified number of groups and then save as a data frame. With this approach I have the 'car names' in the first column and the group number in the second column (more or less, could be cleaned up to look nicer, but is essentially what I am after), which is what I would like to happen with my data.
Any thoughts on this would be appreciated.
# my data
path_country <- read.csv("C:/path_country.csv")
patho <- subset(path_country, select=c(2:188))
patho.d <- dist(patho)
patho.hclust <- hclust(patho.d)
patho.hclust.groups11 = cutree(patho.hclust,11)
sapply(unique(patho.hclust.groups11),function(g)path_country$Country[patho.hclust.groups11 == g])
# mtcars data
car.d <- dist(mtcars)
car.h <- hclust(car.d)
car.h.11 <- cutree(car.h, 11)
nice_result <- as.data.frame(car.h.11)
write.table(nice_result, "test.txt", sep="\t")
1) You can create data.frame with row.names from CSV file:
# Names in the first column
path_country <- read.table("C:/path_country.csv", row.names=1)
# Names in column "Country"
path_country <- read.table("C:/path_country.csv", row.names="Country", head=TRUE)
Note, that in second case you should specify head=TRUE in order to use columns' names.
Now rownames(path_country) should give you vector with rows' names, and as.data.frame(patho.hclust.groups11) nice result for export.
2) At any time you can specify rows' names for your data.frame with command:
rownames(path_country) <- names.vector
where names.vector is a vector with unique names of length equal to number of rows in data.frame. In your example:
rownames(patho.hclust.groups11) <- path_country$Country
Note, that if you are using first approach you don't need this command.

R: Constructing a table from three dataframes of equal columns but different names

I have three dataframes of variable row length:
df1 (column names a,b,c)
df2 (column names d,e,f)
df3 (column names g,h,i)
How can I combine them into one table
(one dataframe under the other)
table.all <- rbind(df1,df2,df3)
only works for same column names but my column names are different.
Then save this table to a csv:
write.csv(table.all ,"table.all .csv")
You want to make sure all columns are the same data type, otherwise you will get an error, but if your data frames are of the same structure, then the solution could be
df1 <- data.frame(a=1,b="a",c=3)
df2 <- data.frame(d=2,e="a",f=3)
df3 <- data.frame(g=3,h="a",i=3)
library(plyr)
ll <- list(df1,df2,df3)
ldply(ll, function(l){ names(l) <- c("col1","col2","col3")
l})
And this will work with data frames with different number of rows as well.
Well you'll have to decide at some point what the column names should be for your final data.frame. Why not set all the column names to what you want them to be then rbind?

Basic R - how to exclude rows with blank columns, how to show data for specific column values

Two questions about R:
1.) If I have a data set with the multiple column values and one of the column values is 'test_score' how can I exclude the rows with blank values (and / or non-numeric values) for that column? (using pie(), hist(), or cor())
2) If the dataset has a column named 'Teachers', how might I graph the column 'testscores' only for the rows where Teacher = Jones?
Creating separate vectors without the missing data:
dat.nomissing <- tenthgrade[!is.nan(Score),]
seems problematic as the two columns must remain paired.
I was thinking something such as:
hist(!is.nan(tenthgrade$Score)[tenthgrade$Teacher=='Jones'])
However, is.nan is creating a list of TRUE, FALSE values (as it should).
Use subscripting. For example:
dat[!is.na(dat$test_score),]
hist(dat$test_score[dat$Teachers=='Jones'])
And a more complete example with artificial data:
# Create artificial dataset
dat <- data.frame('test_score'=rnorm(500), 'Teachers'=sample(c('Jones', 'Smith', 'Clark'), 500, replace=TRUE))
# Introduce some random missingness
dat$test_score[sample(1:500, 50)] <- NA
# Keep if test_score is valid
dat.nomissing <- dat[!is.na(dat$test_score),]
# Plot subset of data
hist(dat$test_score[dat$Teachers=='Jones'])

Resources