How to join data from 2 different csv-files in R? - r

I have the following problem: in a csv-file I have a column for species, one for transect, one for the year and one for the AUC. In another csv-file I have a column for transect, one for the year, one for precipitation and one for temperature. Now I would like to join the files in R in a way, that I can have the columns for species and AUC from the second csv and the columns for the rest from the first csv.
In the end I'd like to get a file with transect_id, year, day, month, species, regional_gam(=AUC), precipitation and LST(=temperature).
So basically the precipitation-/ LST-values from TR001 for every day in 2008 need to be assigned to every species which has an AUC-value for 2008 and TR001.
Thanks!

Use read.csv and then merge.
Load the two csv files into R. (Don't forget to make sure their common variables share the same name!).
df1<-read.csv(dat1,head=T)
df2<-read.csv(dat2,head=T)
Merge the dataframes together by their shared variables and add argument all.x=T (the default) to ensure all rows are kept from your database containing species.
merge(df1,df2,by=c('transect_id','year'),all.x=T)
To see this in action using test data:
test<-data.frame(sp=c(rep(letters[1:10],2)),t=c(rep(1:3,2,20)),y=c(rep(2000:2008,len=20)),AUC=1:20)
test2<-data.frame(t=c(rep(1:3,2,9)),y=c(rep(2000:2008,len=9)),ppt=c(1:9),temp=c(11:19))
merge(test,test2,by=c('t','y'),all.x=T)

Please use
library(dplyr)
df1<- read.csv("F:\\Test_Anything\\Merge\\1.csv" , head(T))
df2<-read.csv("F:\\Test_Anything\\Merge\\2.csv" , head(T))
r <- merge(df1,df2,by=c('NAME','NAME'),all.x=T)
write.csv(r,"F:\\Test_Anything\\Merge\\DF.csv" , all(T) )

In general, to merge .csv files, you can simply use this code snip:
path <- rstudioapi::getActiveDocumentContext()$path
Encoding(path) <- "UTF-8"
setwd(dirname(path))
datap1 = read.csv("file1.csv", header=TRUE)
datap2 = read.csv("file2.csv", header=TRUE)
data <- rbind(datap1, datap2)
write.csv(data,"merged.csv")
Note: First 3 lines of code set the working directory to where the R file is located and are not related to the question.

Related

Merging CSV files of the same names from different folders into one file

I have 14 years of precipitation data for different meteo stations (more than 400) in the structure as follows for years 2008-2021:
/2008/Meteo_2008-01/249180090.csv
/2008/Meteo_2008-02/249180090.csv
/2008/Meteo_2008-03/249180090.csv ... and so on for the rest of the months.
/2009/Meteo_2009-01/249180090.csv
/2009/Meteo_2009-02/249180090.csv
/2009/Meteo_2009-03/249180090.csv ... and so on for the rest of the months.
I have a structure like that until 2021. 249180090.csv - that stands for the station code, as I wrote above I have more than 400 stations.
In the CSV file, there are data on daily precipitation for desired rainfall station.
I would like to create one CSV file for EVERY STATION for every year from 2088 to 2021, which will contain merged information from January until December on the precipitation. The name of CSV file should contain the station number.
Would someone be kind and help me how can I do that in a loop? My goal is not to create just a one file out of all data, but a separate CSV file for every meteo station. On the forum, I have found a question, which was solving relatively similar problem but merging all data just into one file, without sub-division into separate files.
The problem can be split into parts:
Identify all files in all subfolders of the working directory by using list.files(..., recursive = TRUE).
Keep only the csv files
Import them all into r - for example, by mapping read.csv to all paths
Joining everything into a single dataframe, for example with reduce and bind_rows (assuming that all csvs have the same structure)
Split this single dataframes according to station code, for example with group_split()
Writing these split dataframes to csv, for example by mapping write.csv.
This way you can avoid using for loops.
library(here)
library(stringr)
library(purrr)
library(dplyr)
# Identify all files
filenames <- list.files(here(), recursive = TRUE, full.names = TRUE)
# Limit to csv files
joined <- filenames[str_detect(filenames, ".csv")] |>
# Read them all in
map(read.csv) |>
# Join them in
reduce(bind_rows)
# Split into one dataframe per station
split_df <- joined |> group_split(station_code)
# Save each dataframe to a separate csv
map(seq_along(split_df), function(i) {
write.csv(split_df[[i]],
paste0(split_df[[i]][1,1], "_combined.csv"),
row.names = FALSE)
})

Appending two excel files into one dataframe

I am trying to append two excel files from Excel into R.
I am using the following the code to do so:
rm(list = ls(all.names = TRUE))
library(rio) #this is for the excel appending
library("dplyr") #this is required for filtering and selecting
library(tidyverse)
library(openxlsx)
path1 <- "A:/Users/Desktop/Test1.xlsx"
path2 <- "A:/Users/Desktop/Test2.xlsx"
dat = bind_rows(path1,path2)
Output
> dat = bind_rows(path1,path2)
Error: Argument 1 must have names.
Run `rlang::last_error()` to see where the error occurred
I appreciate that this is more for combining rows together, but can someone help me with combining difference workbooks into one dataframe in R Studio?
bind_rows() works with data frames AFTER they have been loaded into the R environment. Here are you merely trying to "bind" 2 strings of characters together, hence the error. First you need to import the data from Excel, which you could do with something like:
test_df1 <- readxl::read_xlsx(path1)
test_df2 <- readxl::read_xlsx(path2)
and then you should be able to run:
test_df <- bind_rows(test_df1, test_df2)
A quicker way would be to iterate the process using the map function from purrr:
test_df <- map_df(c(path1, path2), readxl::read_xlsx)
If you want to append one under the other, meaning that both excels have the same columns, I would
select the rows I wanted from the first excel and create a dataframe
select the rows from the second excel and create a second dataframe
append them with rbind().
On the other hand if you want to append the one next to another, I would choose the columns needed from first and second excel into two dataframes respectively and then I would go with cbind()

How to import and create time series data frames in an efficient way?

I have many separate data files in csv format for a lot of daily stock prices. Over a few years there are hundreds of those data files, whose names are the dates of data record.
In each file there are variables of ticker (or stock trading code), date, open price, high price, low price, close price, and trading volume. For example, inside a data file named 20150128.txt it looks like this:
FB,20150128,1.075,1.075,0.97,0.97,725221
AAPL,20150128,2.24,2.24,2.2,2.24,63682
AMZN,20150128,0.4,0.415,0.4,0.415,194900
NFLX,20150128,50.19,50.21,50.19,50.19,761845
GOOGL,20150128,1.62,1.645,1.59,1.63,684835
...................and many more..................
In case it's relevant, the number of stocks in these files are not necessarily the same (so there will be missing data). I need to import and create 5 separate time series data frames from those files, one each for Open, High, Low, Close and Volume. In each data frame, rows are indexed by date, and columns by ticker. For example, the data frame Open may look like this:
DATE,FB,AAPL,AMZN,NFLX,GOOGL,...
20150128,1.5,2.2,0.4,5.1,1.6,...
20150129,NA,2.3,0.5,5.2,1.7,...
...
What will be an efficient way to do that? I've used the following codes to read the files into a list of data frames but don't know what to do next from here.
files = list.files(pattern="*.txt")
mydata = lapply(files, read.csv,head=FALSE)
EDIT: I managed to get this work (not sure about the downvote though)
files = list.files(pattern="*.txt")
df = ldply(files, read_csv,col_names=c("ticker","date","open","high", "low", "close", "volume"))
This might be more convenient using purrr and readr:
library(tidyverse)
mydata <- files %>%
map_dfr(read_csv)
This way, you'll get a single tibble, rather than a list of data frames.

Differences in imported data from one file vs. lots of files

I have built a function which allows me to process .csv files one by one. This involves importing data using the read.csv function, assigning one of the columns a name, and making a series of calculations based on that one column. However, I'm having problems with how to apply this function to a whole folder of files. Once a list of files is generated, do I need to read the data from each file from within my function, or prior to the application of it? This is what I had previously to import the data:
AllData <- read.csv("filename.csv", header=TRUE, skip=7)
DataForCalcs <- Data[5]
My code resulted in the calculation of a number of variables, which I put into a matrix at the end of the code, and used the apply function to calculate the max of each of those variables.
NewVariables <- matrix(c(Variable1, Variable2, Variable3, Variable4, Variable5)
colnames(NewVariables <- c("Variable1", "Variable2", "Variable3", Variable4", "Variable5")
apply(NewVariables, 2, max, na.rm=TRUE)
This worked great, but I then need to write this table to a new .csv file, which contains these results for each of the ~300 files I want to process, preceded by the name of each file. I'm new to this, so I would really appreciate your time helping me out!
Have you thought about reading in all your .csv files in a loop that combines them into one dataframe? I do this all the time like this:
df <- c()
for (x in list.files(pattern="*.csv")) {
u<-read.csv(x, skip=6)
u$Label = factor(x) #A column that is the filename
df <- rbind(df,u)
}
This of course assumes that every .csv file has an equal number of columns that are named the same thing. But if that assumption is true then you can simply treat the resulting dataframe like one dataframe.
One you have you dataframe entered you can use the Label column as your group by variable. Also you'll need to select only the 5th and 13th variables as well as the label variable. Then if your goal is to take say the max and max values for each .csv file and produce another dataframe of those max values you'd go about it like this.
library(dplyr)
df.summary <- df %>%
group_by(Label) %>%
summarise_each(funs(max)) ##Take the max value of each column except Label
There are better ways to do this using gather() but I don't want to overwhelm you.

read multiple csv files with the same column headings and find the mean

Is it possible to read multiple csv excell files into R. All of the csv files have the same 4 columns. the first is a character, the second and third are numeric and the fourth is integer. I want to combine the data in each numeric column and find the mean.
I can get the csv files into R with
data <- list.files(directory)
myFiles <- paste(directory,data[id],sep="/")
I am unable to get the numbers from the individual columns add them and find the mean.
I am completely new to R and any advice is appreciated.
Here is a simple method:
Prep: Generate dummy data: (You already have this)
dummy <- data.frame(names=rep("a",4), a=1:4,b=5:8)
write.csv(dummy,file="data01.csv",row.names=F)
write.csv(dummy,file="data02.csv",row.names=F)
write.csv(dummy,file="data03.csv",row.names=F)
Step0: Load the file names: (just like you are doing)
data <- dir(getwd(),".csv")
Step1: Read and combine:
DF <- do.call(rbind,lapply(data,function(fn) read.csv(file=fn,header=T)))
DF
Step2: Find mean of appropriate columns:
apply(DF[,2:3],2,mean)
Hope that helps!!
EDIT: If you are having trouble with file path, try ?file.path.

Resources