Transpose AND Stack only specified rows to columns in R - r

This question is quite difficult to describe, but easy to understand when visualized. I would therefore suggest looking at the two images that I linked to this post to help facilitate understanding the issue.
Here is a link to my practice data frame:
sample.data <-read.table("https://pastebin.com/uAQD6nnM", header=T, sep="\t")
I don't know why I get an error "more columns than column names", because using the same file from my desktop works just fine, however clicking on the link goes to my dataset.
I received very large data frames that are arranged in rows, and I want it to be put it in columns, however it is not that 'easy', because I do not necessarily want (or need) to transpose all the data.
This link appears to be close to what I would like to do, but just not quite the right answer for me Python Pandas: Transpose or Stack?
I have a header with GPS data (Coords_Y, Coords_X), followed by a list of 100+ plant species names. If a species is present at a certain location, the author used the term TRUE, and if not present, they used the term FALSE.
I would like to take this data set I've been sent, create a new column called "species", where it stacks each of the species listed in rows on top of each other , & keeps only data set to TRUE. Therefore, as my images point out, if 2 plants are both present at the same location, then the GPS points will need to be duplicated so no data point is lost, and at the same time, if a certain species is present at many locations, the species name will need to be repeated multiple times in the column. In the end, I will have a dataset that is 1000's of rows long, but only 5 columns in my header row.
Before
After

Here is a way to do it using base R:
# Notice that the link works if you include the /raw/ part
sample.data <-read.table("https://pastebin.com/raw/uAQD6nnM", header=T, sep="\t")
vars <- c("var0", "Var.1", "Coords_y", "Coords_x")
# Just selects the ones marked TRUE for each
alf <- sample.data[ sample.data$Alfaroa.williamsii, vars ]
aln <- sample.data[ sample.data$Alnus.acuminata, vars ]
alf$species <- "Alfaroa.williamsii"
aln$species <- "Alnus.acuminata"
final <- rbind(alf,aln)
final
var0 Var.1 Coords_y Coords_x species
192 191 7.10000 -73.00000 Alfaroa.williamsii
101 100 -13.18000 -71.59000 Alfaroa.williamsii
36 35 10.18234 -84.10683 Alnus.acuminata
38 37 10.26787 -84.05528 Alnus.acuminata
To do it more generally, using dplyr and tidyr, you can use the gather function:
library(dplyr)
library(tidyr)
tidyr::gather(sample.data, key = "species", value = "keep", 5:6) %>%
dplyr::filter(keep) %>%
dplyr::select(-keep)
Just replace the 5:6 with the indices of the columns of different species.

I could not download the data so I made some:
sample.data=data.frame(var0=c(192,36,38,101),var1=c(191,35,37,100),y=c(7.1,10.1,10.2,-13.8),x=c(-73,-84,-84,-71),
Alfaroa=c(T,F,F,T),Alnus=c(T,T,T,F))
the code that gives the requested result is:
dfAlfaroa=sample.data%>%filter(Alfaroa)%>%select(-Alnus)%>%rename("Species"="Alfaroa")%>%replace("Species","Alfaroa")
dfAlnus=sample.data%>%filter(Alnus)%>%select(-Alfaroa)%>%rename("Species"="Alnus")%>%replace("Species","Alnus")
rbind(dfAlfaroa,dfAlnus)

Related

How to deal with a factor that's a proportion in R?

I am dealing with a dataset that has confirmation margins for individuals confirmed in the US Senate. The dataset I'm pulling from has them coded as factors with each possible margin (100/0, 50/50, etc) as a different level. I need to assign the values of these margins to a column in another data frame. Right now my code looks something like:
for (i in fedjud_scotus$Judge.Name) {
justice_data$confirm_margin[justice_data$justice==i] <- fedjud_scotus$Ayes.Nays[fedjud_scotus$Judge.Name==i]
}
where fedjud_scotus is the original data frame, and justice_data is the new data frame I'm trying to add confirmation data into. Right now, this is only moving the level (ex. 3,4,8), not the actual margin (64/36, 93/7, etc). Is there a way to get the actual margin data to move where I want it?
Without a reproducible example, it is difficult to know exactly what you are looking for. I have taken a guess below.
Once you convert the factors to strings, you can do this using string manipulation. The strsplit function will split a string into parts. However, it does not play nicely with dplyr.
However, this question and its answers provide multiple options for approaches that use the same idea but implemented in a way that works nicely with dplyr.
Example solution:
library(dplyr)
library(tidyr)
df = data.frame(proportions = c("100/0","70/30","50/50"),
stringsAsFactors = TRUE)
split = df %>%
mutate(proportions = as.character(proportions)) %>%
separate(proportions, c("win", "loss"), "/")
output = cbind(df, split) %>%
mutate(win = as.numeric(win),
loss = as.numeric(loss))
Gives:
proportions win loss
1 100/0 100 0
2 70/30 70 30
3 50/50 50 50

rbind three data bases using Rbind function

I know it's a newbie question, I have these 3 xlsx files with 3 three data bases of the same 14 variables,its a cross section data panel ,
All I want is to concatenate them in one single data base called eplt,
First, I import them
library(dplyr)
library(ggplot2)
library(xlsx)
##Import the three data bases
epl_data<-read.xlsx("Notes_ETAB2016-2017.xlsx",sheetIndex = 1,header = TRUE)
epl_data2<-read.xlsx("Notes_ETAB2017-2018.xlsx",sheetIndex = 1,header = TRUE)
epl_data3<-read.xlsx("Notes_ETAB2018-2019.xlsx",sheetIndex = 1,header = TRUE)
## to render the number of rows in each of them
nrow(epl_data)
nrow(epl_data2)
nrow(epl_data3)
# I want to rbind the three sets together
eplt<-rbind(epl_data,epl_data2,epl_data3)
the total number of rows is 29441, but when applying Rbind to bind them all together I get the error
> eplt<-rbind(epl_data,epl_data2,epl_data3)
Error in match.names(clabs, names(xi)) :
names do not match previous names
but the names of the variables in the 3 sets are the same
could someone please help, I only want to rebind 25000 observations, and leave the rest 4441 to compare it with the predictable obs of a multiple regression model,
thanks in advance
The third dataframes doesn't have the same names as the first two: Svt isn't to upper cases.
One way is to apply the names of one dataframe to the others:
colnames(epl_data2) <- colnames(epl_data)
colnames(epl_data3) <- colnames(epl_data)
But i recommand the package janitor whenever your data comes from Excel files. Indeed, it is common to have variable names issues. This package ensure a good formatting of your data column names:
epl_data <- janitor::clean_names(epl_data)
epl_data2 <- janitor::clean_names(epl_data2)
epl_data3 <- janitor::clean_names(epl_data3)
Therefore, the rbind should work
As already mentioned you have a mismatch in the variable name 'SVT'. Here is an alternative that would make the column names lower case and bind them together in one dataframe.
library(dplyr)
library(purrr)
eplt <- list.files(pattern = 'Notes_ETAB2016-\\d+\\.xlsx') %>%
map_df(~readxl::read_excel(.x) %>% rename_with(~tolower(.)))

Change the class of a cell of a data-frame to Date

everyone!
As part of my clinical study I created a xlsx spreadsheet containing a data set. Only columns 2 to 12 and lines 1 to 307 are useful to me. I now manipulate my spreadsheet under R, after importing it (read_excel, etc.).
In my columns 11 and 12 ('data' and 'raw_data'), some cells correspond to dates (for example the first 2 rows of 'data' and 'raw_data'). Indeed, this corresponds to the patient's visit dates. However, as you can see, these dates are given to me in number of days since the origin "1899-12-30". However, I would like to be able to transform them into a current date format (2019-07-05).
My problem is that in these columns I don't only have dates, I have different numerical results (times, means, scores, etc.) .
I started by transforming the class of my columns from character to factor/numeric so that I could better manipulate the columns later. But I can't change only the format of cells corresponding to a date.
Do you know if it is possible to transform only the cells concerned and if so how?
I attach my code and a preview of my data frame.
Part "Unsuccessful trial": I tried with this kind of thing. Of course the date changes format here but as soon as I try to make this change in the data frame it doesn't work.
Thank you for your help!
# Indicate the id of the patient
id = "01_AA"
# Get protocol data of patient
idlst <- dir("/data/protocolData", full.names = T, pattern = id)
# Convert the xlsx database into dataframe
idData <- data.table::rbindlist(lapply(
idlst,
read_excel,
n_max = 307,
range = cell_cols("B:M"), # just keep the table
), fill = TRUE)
idData <- as.tibble(idData)
idData<- idData %>%
mutate_at(vars(1:10), as.factor)%>%
mutate_at(vars(11:length(idData)), as.numeric)
# Unsuccessful trial
as.Date.character(data[1:2,11:12], origin ='1899-12-30')
Thank you for your comments and indeed this is one of the problems with R.
I solved my problem with the following code where idData is my df.
# Change the data format of the date cells of the column Data and Raw_data:
idData$Data[grepl("date",idData$Measure)] <- as.character(as.Date(
as.numeric(
idData$Data[grepl("date",idData$Measure)]),
origin = "1899-12-30"))

Merging files that are formatted differently in Rstudio

For my master thesis I'm trying to merge two files, they contain the following:
one has metrics on my study subjects (seagulls), the other has the reproductive success of these birds.
They are, however, formatted differently: the file with metrics already takes two separate rows for the different phases of the breeding period, thus an individual has multiple rows per year.
The other file with reproductive success does not, there is only one row per individual per year, and the columns belonging to these rows represent the reproductive parameters of each breeding phase.
Now I know I can't just straight up 'merge' the two files in Rstudio, so I wonder how I would go about formatting the files so I can.
I will add pictures to help with interpretation:
First file
Second file
Thank you very much in advance!
you should first start by considering WHY you want to merge the files. From what I can see, your files are best kept separate, because in your first file, you are recording common metrics across both phases (headers are the same), while in the second first, you are recording differing metrics across both phases (headers are different).
As the second file contains differing headers for the 2 phases, it would not be possible to convert it to a similar form of the first file. It is however possible to convert the first file into the format of the second file, and hence allowing you to combine the two files. However I strongly caution against this, as it could prevent you from making quick analysis of your data in this manner.
library(ggplot2)
dat <- read.csv("file1.csv")
# This plots a boxplot comparing the evenness of the 2 phases
ggplot(dat, aes(x = as.factor(period), y = evenness)) + geom_boxplot()
However if you insist, here is the code to reformat file1 into a single row per entry to be combined with file2
# One more warning, depending on how you
# want to eventually wrangle your data,
# doing this might make your life more difficult
library(dplyr)
f1 <- read.csv("file1.csv", stringAsFactors = FALSE)
dat1 <- dat[f1$phase == "incubating",]
dat2 <- dat[f1$phase == "chickrearing",]
dat2$phase <- dat1$phase <- NULL
names(dat1) <- c("bird.year", paste0("incubating.", names(dat1)[2:length(names(dat1))]))
names(dat2) <- c("bird.year", paste0("chickrearing.", names(dat2)[2:length(names(dat2))]))
f1.combined <- merge(dat1, dat2, by = "bird.year")
f2 <- read.csv("file2.csv")
f2 <- mutate(f2, bird.year = paste(Individual, year))
combined.files <- merge(f1.combined, f2, by = "bird.year")

Matching and sorting data in R or Excel

I have a list of bacteria each with it's own abundance in a dataframe. I also have the same list of bacteria but in a different order in the same dataframe.
I want to match the abundances to this second list but I'm not sure how to go about doing it.
dyplyr contains several methods for sorting data but I don't know how to match the abundance and print it into a new column so it now matches with the second list of bacteria.
Here's the beginning of my dataset:
Taxon Total_abundance Tips
Acaricomes phytoseiuli 0.000382414 Methanothermobacter thermautotrophicus
Acetivibrio cellulolyticus 0.013979274 Methanobacterium beijingense
Acetobacter aceti 0.181150551 Methanobacterium bryantii
Acetobacter estunensis 0.023074895 Methanosarcina mazei
Acetobacter tropicalis 0.014615221 Persephonella marina
Achromobacter piechaudii 0.031811039 Sulfurihydrogenibium azorense
Achromobacter xylosoxidans 0.041558442 Balnearium lithotrophicum
Acidicapsa borealis 0.035525932 Isosphaera pallida
Acidimicrobium ferrooxidans 0.013841209 Simkania negevensis
Acidiphilium angustum 0.041702984 Parachlamydia acanthamoebae
Acidiphilium cryptum 0.039265944 Leptospira biflexa
Acidiphilium rubrum 0.041702984 Leptospira fainei
...
So, the abundance matches the data in Taxon column, and I want the abundance to also be matched with the bacteria in the "Tips" column.
For example, Acaricomes phytoseiuli has an abundance of 0.000382414, so in column D 0.000382414 will be printed next to where Acaricomes phytoseiuli is located. Again, Taxon and Tips contains exactly the same data, just in a different order.
I hope that makes sense.
It doesn't matter if this is done in R or Excel, thanks.
As others have mentioned, it's hard to test without some data that matches, but something like this should work, using match to match up values.
df$D <- df$Total_abundance[ match( df$Tips, df$Taxon ) ]
I assume that your list of bacteria is unique
as a sample data frame:
dff <- data.frame(bacteria1=letters[1:10], abundance1=runif(10,0,1),
bacteri2=sample(letters[1:10],10), abundance2=0)
now we will find the bacteria rows and insert the abundances:
for(i in 1:nrow(dff)){
s <- which(dff$bacteri2[i]==dff$bacteria1)
dff$abundance2[i] <- dff$abundance1[s]
}
In excel under column D you can do the following:
=VLOOKUP(C3;A3:B13;2;FALSE)
C3 would be the TIP and A3:B13 the range where it searches for this, A being the bacteria name and B the abundance and if found will return the corresponding abundance of the match.
If you get an error like #N/A than there is no match. You can also avoid these errors by using this formula:
=IFNA(VLOOKUP(C3;$A$3:B13;2;FALSE);"No match")
Edit: Adjust the ranges to your file!
Edit 2: Keep in mind the seperator I use is ; and your excel might use the comma , seperator
First of all, if your Taxon and Tips columns contain exactly the same data, only in different order, they have no place being together in the same data frame. You should either have two data frames, or come up with some sort of key to define the place of a Taxon item in the phylogenetic tree and then re-sort the data frame as needed, either in alphabetic order or by phylogeny.
As a quick solution, I would first extract the Tips column in a separate data frame, join it with the original data frame by the Tips and Taxon columns, thus obtaining the correct order of abundance values in the new data frame and (if you still insist) using cbind to glue the newly re-sorted abundance column back into the original data frame. Like so, assuming you're using dplyr (df is a dummy stand-in for your data set):
df <- data.frame(Taxon=c("a","b","c","d","e"), Abundance=c(1:5), Tips=c("b","a","d","c","e"))
new_df <- select(df, Tips)
new_df <- left_join(new_df, df, by=c("Tips"= "Taxon"))
df <- cbind(df, New_Abund=new_df$Abundance)
rm(new_df)

Resources