Merging files that are formatted differently in Rstudio - r

For my master thesis I'm trying to merge two files, they contain the following:
one has metrics on my study subjects (seagulls), the other has the reproductive success of these birds.
They are, however, formatted differently: the file with metrics already takes two separate rows for the different phases of the breeding period, thus an individual has multiple rows per year.
The other file with reproductive success does not, there is only one row per individual per year, and the columns belonging to these rows represent the reproductive parameters of each breeding phase.
Now I know I can't just straight up 'merge' the two files in Rstudio, so I wonder how I would go about formatting the files so I can.
I will add pictures to help with interpretation:
First file
Second file
Thank you very much in advance!

you should first start by considering WHY you want to merge the files. From what I can see, your files are best kept separate, because in your first file, you are recording common metrics across both phases (headers are the same), while in the second first, you are recording differing metrics across both phases (headers are different).
As the second file contains differing headers for the 2 phases, it would not be possible to convert it to a similar form of the first file. It is however possible to convert the first file into the format of the second file, and hence allowing you to combine the two files. However I strongly caution against this, as it could prevent you from making quick analysis of your data in this manner.
library(ggplot2)
dat <- read.csv("file1.csv")
# This plots a boxplot comparing the evenness of the 2 phases
ggplot(dat, aes(x = as.factor(period), y = evenness)) + geom_boxplot()
However if you insist, here is the code to reformat file1 into a single row per entry to be combined with file2
# One more warning, depending on how you
# want to eventually wrangle your data,
# doing this might make your life more difficult
library(dplyr)
f1 <- read.csv("file1.csv", stringAsFactors = FALSE)
dat1 <- dat[f1$phase == "incubating",]
dat2 <- dat[f1$phase == "chickrearing",]
dat2$phase <- dat1$phase <- NULL
names(dat1) <- c("bird.year", paste0("incubating.", names(dat1)[2:length(names(dat1))]))
names(dat2) <- c("bird.year", paste0("chickrearing.", names(dat2)[2:length(names(dat2))]))
f1.combined <- merge(dat1, dat2, by = "bird.year")
f2 <- read.csv("file2.csv")
f2 <- mutate(f2, bird.year = paste(Individual, year))
combined.files <- merge(f1.combined, f2, by = "bird.year")

Related

convert lists to data frames

I'm a beginner in R. I'm working in a data to expand my knowledge especially in data manipulation.
The task is to split my data set based on a parameter(column). Then to calculate the standard deviation for each group, then to provide some graphs. I did split my data set to about 3000 list, but I'm stuck in converting the lists into separate data sets so I can collect the SD for each data set. Or if there is an efficient way to do it in one code.
this is what I did so far.
xx <- read.table("NetworkRail1.csv",sep=",",header=TRUE)
selected <- select(xx,ID, Location, Top70m)
splitNR <- split(selected, selected$Location %% 0.125)
If your problem is to save the different components of the list separately, you can try this:
list2env(splitNR, envir=.GlobalEnv)

Transpose AND Stack only specified rows to columns in R

This question is quite difficult to describe, but easy to understand when visualized. I would therefore suggest looking at the two images that I linked to this post to help facilitate understanding the issue.
Here is a link to my practice data frame:
sample.data <-read.table("https://pastebin.com/uAQD6nnM", header=T, sep="\t")
I don't know why I get an error "more columns than column names", because using the same file from my desktop works just fine, however clicking on the link goes to my dataset.
I received very large data frames that are arranged in rows, and I want it to be put it in columns, however it is not that 'easy', because I do not necessarily want (or need) to transpose all the data.
This link appears to be close to what I would like to do, but just not quite the right answer for me Python Pandas: Transpose or Stack?
I have a header with GPS data (Coords_Y, Coords_X), followed by a list of 100+ plant species names. If a species is present at a certain location, the author used the term TRUE, and if not present, they used the term FALSE.
I would like to take this data set I've been sent, create a new column called "species", where it stacks each of the species listed in rows on top of each other , & keeps only data set to TRUE. Therefore, as my images point out, if 2 plants are both present at the same location, then the GPS points will need to be duplicated so no data point is lost, and at the same time, if a certain species is present at many locations, the species name will need to be repeated multiple times in the column. In the end, I will have a dataset that is 1000's of rows long, but only 5 columns in my header row.
Before
After
Here is a way to do it using base R:
# Notice that the link works if you include the /raw/ part
sample.data <-read.table("https://pastebin.com/raw/uAQD6nnM", header=T, sep="\t")
vars <- c("var0", "Var.1", "Coords_y", "Coords_x")
# Just selects the ones marked TRUE for each
alf <- sample.data[ sample.data$Alfaroa.williamsii, vars ]
aln <- sample.data[ sample.data$Alnus.acuminata, vars ]
alf$species <- "Alfaroa.williamsii"
aln$species <- "Alnus.acuminata"
final <- rbind(alf,aln)
final
var0 Var.1 Coords_y Coords_x species
192 191 7.10000 -73.00000 Alfaroa.williamsii
101 100 -13.18000 -71.59000 Alfaroa.williamsii
36 35 10.18234 -84.10683 Alnus.acuminata
38 37 10.26787 -84.05528 Alnus.acuminata
To do it more generally, using dplyr and tidyr, you can use the gather function:
library(dplyr)
library(tidyr)
tidyr::gather(sample.data, key = "species", value = "keep", 5:6) %>%
dplyr::filter(keep) %>%
dplyr::select(-keep)
Just replace the 5:6 with the indices of the columns of different species.
I could not download the data so I made some:
sample.data=data.frame(var0=c(192,36,38,101),var1=c(191,35,37,100),y=c(7.1,10.1,10.2,-13.8),x=c(-73,-84,-84,-71),
Alfaroa=c(T,F,F,T),Alnus=c(T,T,T,F))
the code that gives the requested result is:
dfAlfaroa=sample.data%>%filter(Alfaroa)%>%select(-Alnus)%>%rename("Species"="Alfaroa")%>%replace("Species","Alfaroa")
dfAlnus=sample.data%>%filter(Alnus)%>%select(-Alfaroa)%>%rename("Species"="Alnus")%>%replace("Species","Alnus")
rbind(dfAlfaroa,dfAlnus)

How to load part of data set efficiently based on time index in R

I'm often working with several .csv files, having a size of >100MB; however, in many cases I only need a subset of the data. The subset is always within a certain time interval. My question is if there's a function or way in R so that I can only load a subset of the data, not knowing the index of the timetags?
This is how I usually do it:
Imagine I have a .csv file called Large_CSV_file.csv and I have minute timestamps of 20 years length in the first column and data in the second.
# Load entire data set
dat <- read.csv("Large_CSV_file.csv")
# Find idx where to cut the data set
start.idx <- which(dat[,1] == as.Date("01-01-1990"))
end.idx <- which(dat[,1] == as.Date("01-01-1991"))
# Subset of data set
dat <- dat[start.idx:end.idx,]
The first line of code is the one that takes a lot of time to load and I feel it's not efficient to load the entire data set, before discarding 99% of it...

Differences in imported data from one file vs. lots of files

I have built a function which allows me to process .csv files one by one. This involves importing data using the read.csv function, assigning one of the columns a name, and making a series of calculations based on that one column. However, I'm having problems with how to apply this function to a whole folder of files. Once a list of files is generated, do I need to read the data from each file from within my function, or prior to the application of it? This is what I had previously to import the data:
AllData <- read.csv("filename.csv", header=TRUE, skip=7)
DataForCalcs <- Data[5]
My code resulted in the calculation of a number of variables, which I put into a matrix at the end of the code, and used the apply function to calculate the max of each of those variables.
NewVariables <- matrix(c(Variable1, Variable2, Variable3, Variable4, Variable5)
colnames(NewVariables <- c("Variable1", "Variable2", "Variable3", Variable4", "Variable5")
apply(NewVariables, 2, max, na.rm=TRUE)
This worked great, but I then need to write this table to a new .csv file, which contains these results for each of the ~300 files I want to process, preceded by the name of each file. I'm new to this, so I would really appreciate your time helping me out!
Have you thought about reading in all your .csv files in a loop that combines them into one dataframe? I do this all the time like this:
df <- c()
for (x in list.files(pattern="*.csv")) {
u<-read.csv(x, skip=6)
u$Label = factor(x) #A column that is the filename
df <- rbind(df,u)
}
This of course assumes that every .csv file has an equal number of columns that are named the same thing. But if that assumption is true then you can simply treat the resulting dataframe like one dataframe.
One you have you dataframe entered you can use the Label column as your group by variable. Also you'll need to select only the 5th and 13th variables as well as the label variable. Then if your goal is to take say the max and max values for each .csv file and produce another dataframe of those max values you'd go about it like this.
library(dplyr)
df.summary <- df %>%
group_by(Label) %>%
summarise_each(funs(max)) ##Take the max value of each column except Label
There are better ways to do this using gather() but I don't want to overwhelm you.

How to join data from 2 different csv-files in R?

I have the following problem: in a csv-file I have a column for species, one for transect, one for the year and one for the AUC. In another csv-file I have a column for transect, one for the year, one for precipitation and one for temperature. Now I would like to join the files in R in a way, that I can have the columns for species and AUC from the second csv and the columns for the rest from the first csv.
In the end I'd like to get a file with transect_id, year, day, month, species, regional_gam(=AUC), precipitation and LST(=temperature).
So basically the precipitation-/ LST-values from TR001 for every day in 2008 need to be assigned to every species which has an AUC-value for 2008 and TR001.
Thanks!
Use read.csv and then merge.
Load the two csv files into R. (Don't forget to make sure their common variables share the same name!).
df1<-read.csv(dat1,head=T)
df2<-read.csv(dat2,head=T)
Merge the dataframes together by their shared variables and add argument all.x=T (the default) to ensure all rows are kept from your database containing species.
merge(df1,df2,by=c('transect_id','year'),all.x=T)
To see this in action using test data:
test<-data.frame(sp=c(rep(letters[1:10],2)),t=c(rep(1:3,2,20)),y=c(rep(2000:2008,len=20)),AUC=1:20)
test2<-data.frame(t=c(rep(1:3,2,9)),y=c(rep(2000:2008,len=9)),ppt=c(1:9),temp=c(11:19))
merge(test,test2,by=c('t','y'),all.x=T)
Please use
library(dplyr)
df1<- read.csv("F:\\Test_Anything\\Merge\\1.csv" , head(T))
df2<-read.csv("F:\\Test_Anything\\Merge\\2.csv" , head(T))
r <- merge(df1,df2,by=c('NAME','NAME'),all.x=T)
write.csv(r,"F:\\Test_Anything\\Merge\\DF.csv" , all(T) )
In general, to merge .csv files, you can simply use this code snip:
path <- rstudioapi::getActiveDocumentContext()$path
Encoding(path) <- "UTF-8"
setwd(dirname(path))
datap1 = read.csv("file1.csv", header=TRUE)
datap2 = read.csv("file2.csv", header=TRUE)
data <- rbind(datap1, datap2)
write.csv(data,"merged.csv")
Note: First 3 lines of code set the working directory to where the R file is located and are not related to the question.

Resources