Matching and sorting data in R or Excel - r

I have a list of bacteria each with it's own abundance in a dataframe. I also have the same list of bacteria but in a different order in the same dataframe.
I want to match the abundances to this second list but I'm not sure how to go about doing it.
dyplyr contains several methods for sorting data but I don't know how to match the abundance and print it into a new column so it now matches with the second list of bacteria.
Here's the beginning of my dataset:
Taxon Total_abundance Tips
Acaricomes phytoseiuli 0.000382414 Methanothermobacter thermautotrophicus
Acetivibrio cellulolyticus 0.013979274 Methanobacterium beijingense
Acetobacter aceti 0.181150551 Methanobacterium bryantii
Acetobacter estunensis 0.023074895 Methanosarcina mazei
Acetobacter tropicalis 0.014615221 Persephonella marina
Achromobacter piechaudii 0.031811039 Sulfurihydrogenibium azorense
Achromobacter xylosoxidans 0.041558442 Balnearium lithotrophicum
Acidicapsa borealis 0.035525932 Isosphaera pallida
Acidimicrobium ferrooxidans 0.013841209 Simkania negevensis
Acidiphilium angustum 0.041702984 Parachlamydia acanthamoebae
Acidiphilium cryptum 0.039265944 Leptospira biflexa
Acidiphilium rubrum 0.041702984 Leptospira fainei
...
So, the abundance matches the data in Taxon column, and I want the abundance to also be matched with the bacteria in the "Tips" column.
For example, Acaricomes phytoseiuli has an abundance of 0.000382414, so in column D 0.000382414 will be printed next to where Acaricomes phytoseiuli is located. Again, Taxon and Tips contains exactly the same data, just in a different order.
I hope that makes sense.
It doesn't matter if this is done in R or Excel, thanks.

As others have mentioned, it's hard to test without some data that matches, but something like this should work, using match to match up values.
df$D <- df$Total_abundance[ match( df$Tips, df$Taxon ) ]

I assume that your list of bacteria is unique
as a sample data frame:
dff <- data.frame(bacteria1=letters[1:10], abundance1=runif(10,0,1),
bacteri2=sample(letters[1:10],10), abundance2=0)
now we will find the bacteria rows and insert the abundances:
for(i in 1:nrow(dff)){
s <- which(dff$bacteri2[i]==dff$bacteria1)
dff$abundance2[i] <- dff$abundance1[s]
}

In excel under column D you can do the following:
=VLOOKUP(C3;A3:B13;2;FALSE)
C3 would be the TIP and A3:B13 the range where it searches for this, A being the bacteria name and B the abundance and if found will return the corresponding abundance of the match.
If you get an error like #N/A than there is no match. You can also avoid these errors by using this formula:
=IFNA(VLOOKUP(C3;$A$3:B13;2;FALSE);"No match")
Edit: Adjust the ranges to your file!
Edit 2: Keep in mind the seperator I use is ; and your excel might use the comma , seperator

First of all, if your Taxon and Tips columns contain exactly the same data, only in different order, they have no place being together in the same data frame. You should either have two data frames, or come up with some sort of key to define the place of a Taxon item in the phylogenetic tree and then re-sort the data frame as needed, either in alphabetic order or by phylogeny.
As a quick solution, I would first extract the Tips column in a separate data frame, join it with the original data frame by the Tips and Taxon columns, thus obtaining the correct order of abundance values in the new data frame and (if you still insist) using cbind to glue the newly re-sorted abundance column back into the original data frame. Like so, assuming you're using dplyr (df is a dummy stand-in for your data set):
df <- data.frame(Taxon=c("a","b","c","d","e"), Abundance=c(1:5), Tips=c("b","a","d","c","e"))
new_df <- select(df, Tips)
new_df <- left_join(new_df, df, by=c("Tips"= "Taxon"))
df <- cbind(df, New_Abund=new_df$Abundance)
rm(new_df)

Related

Changing columns name of a df in only one step

I just created a data frame from a set of datas (let's call it "rf1") today.
Here is the df
df<- matrix(0,1052,156)
As you can see there is no names for the 156 columns.
I would like to know if there is a method to change all the names of columns of my df with the names of my rf1 which has the exact same number of col and rows...
I know the old way but I don't want to pass 3h to write all the columns names in a vec...
And same question for the rows !

How do I show which variables are not shared by two datasets in R?

I have two data sets (A and B), one with 1600 observations/ rows and 1002 Variables/columns and one with 860 observations/rows and 1040 variables/ columns. I want to quickly check which variables are not contained in dataset A but are in dataset B and vice versa. I am only interestes in the column names, not in the onservations contained within these columns.
I found this great function here: https://cran.r-project.org/web/packages/arsenal/vignettes/comparedf.html and essencially I would want to get an output similar to this:
The code I am trying is: summary(comparedf(dataA, dataB)) However, the table is not printed because R does a row by row comparision of both data sets and then runs out of space when printing the results in the console. Is there a quick way of achieving what I need here?
I think you can use the anti_join() function from the dplyr package to find the unmatched records. It will give you an output of the rows that both data sets A and B do not share in common. Here is an example:-
table1<-data.frame(id=c(1:5), animal=c("cat", "dog", "parakeet",
"lion", "duck"))
table2<-table1[c(1,3,5),]
library(dplyr)
anti_join(table1, table2, by="id")
id animal
1 2 dog
2 4 lion
This will return the unshared rows by ID.
Edit
If you are wanting to find which column names/variables appear in one data frame but not the other, then you could use this solution:-
df1 <- data.frame(a=rnorm(100), b=rnorm(100), not=rnorm(100))
df2 <- data.frame(a=rnorm(100), b=rnorm(100))
df1[, !names(df1) %in% names(df2)] #returns column/variable that appears in df1 but not in df2
I hope this answers your question. It will return the actual values beneath each unshared column/variable, but you could save the output to an object and run colnames() on it, which should print your unshared column/variable names.
It may be a bit clunky, but combining setdiff() with colnames() may work.
Doing both setdiff(colnames(DataA),colnames(DataB)) and setdiff(colnames(DataB),colnames(DataA)) will give you 2 vectors, each with the names of the columns present in one of the datasets but not in the other one.

R - Removing rows in data frame by list of column values

I have two data frames, one containing the predictors and one containing the different categories I want to predict. Both of the data frames contain a column named geoid. Some of the rows of my predictors contains NA values, and I need to remove these.
After extracting the geoid value of the rows containing NA values, and removing them from the predictors data frame I need to remove the corresponding rows from the categories data frame as well.
It seems like a rather basic operation but the code won't work.
categories <- as.data.frame(read.csv("files/cat_df.csv"))
predictors <- as.data.frame(read.csv("files/radius_100.csv"))
NA_rows <- predictors[!complete.cases(predictors),]
geoids <- NA_rows['geoid']
clean_categories <- categories[!(categories$geoid %in% geoids),]
None of the rows in categories/clean_categories are removed.
A typical geoid value is US06140231. typeof(categories$geoid) returns integer.
I can't say this is it, but a very basic typo won't be doing what you want, try this correction
clean_categories <- categories[!(categories$geoid %in% geoids),]
Almost certainly this is what you meant to happen in that line. You want to negate the result of the %in% operator. You don't include a reproducible example so I can't say whether the whole thing will do as you want.

Transpose AND Stack only specified rows to columns in R

This question is quite difficult to describe, but easy to understand when visualized. I would therefore suggest looking at the two images that I linked to this post to help facilitate understanding the issue.
Here is a link to my practice data frame:
sample.data <-read.table("https://pastebin.com/uAQD6nnM", header=T, sep="\t")
I don't know why I get an error "more columns than column names", because using the same file from my desktop works just fine, however clicking on the link goes to my dataset.
I received very large data frames that are arranged in rows, and I want it to be put it in columns, however it is not that 'easy', because I do not necessarily want (or need) to transpose all the data.
This link appears to be close to what I would like to do, but just not quite the right answer for me Python Pandas: Transpose or Stack?
I have a header with GPS data (Coords_Y, Coords_X), followed by a list of 100+ plant species names. If a species is present at a certain location, the author used the term TRUE, and if not present, they used the term FALSE.
I would like to take this data set I've been sent, create a new column called "species", where it stacks each of the species listed in rows on top of each other , & keeps only data set to TRUE. Therefore, as my images point out, if 2 plants are both present at the same location, then the GPS points will need to be duplicated so no data point is lost, and at the same time, if a certain species is present at many locations, the species name will need to be repeated multiple times in the column. In the end, I will have a dataset that is 1000's of rows long, but only 5 columns in my header row.
Before
After
Here is a way to do it using base R:
# Notice that the link works if you include the /raw/ part
sample.data <-read.table("https://pastebin.com/raw/uAQD6nnM", header=T, sep="\t")
vars <- c("var0", "Var.1", "Coords_y", "Coords_x")
# Just selects the ones marked TRUE for each
alf <- sample.data[ sample.data$Alfaroa.williamsii, vars ]
aln <- sample.data[ sample.data$Alnus.acuminata, vars ]
alf$species <- "Alfaroa.williamsii"
aln$species <- "Alnus.acuminata"
final <- rbind(alf,aln)
final
var0 Var.1 Coords_y Coords_x species
192 191 7.10000 -73.00000 Alfaroa.williamsii
101 100 -13.18000 -71.59000 Alfaroa.williamsii
36 35 10.18234 -84.10683 Alnus.acuminata
38 37 10.26787 -84.05528 Alnus.acuminata
To do it more generally, using dplyr and tidyr, you can use the gather function:
library(dplyr)
library(tidyr)
tidyr::gather(sample.data, key = "species", value = "keep", 5:6) %>%
dplyr::filter(keep) %>%
dplyr::select(-keep)
Just replace the 5:6 with the indices of the columns of different species.
I could not download the data so I made some:
sample.data=data.frame(var0=c(192,36,38,101),var1=c(191,35,37,100),y=c(7.1,10.1,10.2,-13.8),x=c(-73,-84,-84,-71),
Alfaroa=c(T,F,F,T),Alnus=c(T,T,T,F))
the code that gives the requested result is:
dfAlfaroa=sample.data%>%filter(Alfaroa)%>%select(-Alnus)%>%rename("Species"="Alfaroa")%>%replace("Species","Alfaroa")
dfAlnus=sample.data%>%filter(Alnus)%>%select(-Alfaroa)%>%rename("Species"="Alnus")%>%replace("Species","Alnus")
rbind(dfAlfaroa,dfAlnus)

Using the split function to group a dataframe by factor, alternatives for large dataframes

I have a question regarding using the split function to group data by factor.
I have a data frame of two columns snps and gene. Snps is a factor, gene is a character vector. I want to group genes by the snp factor so I can see a list of genes mapping to each snp. Some snps may map to more than one gene, for example rs10000226 maps to gene 345274 and gene 5783, and genes occur multiple times.
To do this I used the split function to make a list of genes each snp maps to.
snps<-c("rs10000185", "rs1000022", "rs10000226", "rs10000226")
gene<-c("5783", "171425", "345274", "5783")
df<-data.frame(snps, gene) # snps is a factor
df$gene<-as.character(df$gene)
splitted=split(df, df$gene, drop=T) # group by gene
snpnames=unique(df$snps)
df.2<-lapply(splitted, function(x) { x["snps"] <- NULL; x }) # remove the snp column
names(df.2)=snpnames # rename the list elements by snp
df.2 = sapply(df.2, function(x) list(as.character(x$gene)))
save(df.2, file="df.2.rda")
However this is not effective for my full dataframe (probably due to its size - 363422 rows, 281370 unique snps, 20888 unique genes) and R crashes whilst trying to load df.2.rda` later on.
Any suggestions for alternative ways to do this would be much appreciated!
There is a shorter way to create your df.2:
genes_by_snp <- split(df$gene,df$snp)
You can look at the genes for a given snp with genes_by_snp[["rs10000226"]].
Your data set does not sound so big to me, but you could avoid creating the list above by storing your original data differently. Expanding on #AnandoMahto's comment, here's how to use the data.table package:
require(data.table)
setDT(df)
setkey(df,snps)
You can look at the genes for a given snp with df[J("rs10000226")].

Resources