Comparing two dataframes if a column from both has a common entry - r

I have two csv files and I am using R-
https://drive.google.com/open?id=1CSLDs9qQXPMqMegdsWK2cQI_64B9org7
https://drive.google.com/open?id=1mVp1s0m4OZNNctVBn5JXIYK1JPsp-aiw
As is visible from the files, each file has a list of dates running from 2008 to the present along with other columns.
I want my output to be two files, but both should contain rows of data for the dates present in both files.
For eg. say date X is not there on 1 file, then it should be removed from the other file where it is present as well. Only dates and the corresponding rows present on both columns should survive on both output files.
I tried the inner_join function in the dplyr library but that didn't work because the dates are in factor format.

You can avoid the factor conversion of character strings by adding stringAsFactors = F. In addition, in your dataset you have NA coded as the string null, so you should also specify this in the call to read.csv
path1 <- "the path for the first dataset KS"
path2 <- "the path for the second dataset 105560.KS"
df1 <- read.csv(path1,stringsAsFactors = F)
df2 <- read.csv(path2,stringsAsFactors = F,na.strings = "null")
df_comb <- inner_join(df1,df2,by = "Date")

Related

Combining a list of data frames into a new data frame in R

This is a 3rd edit to the question (leaving below thread just in case):
The following code makes some sample data frames, selects those with "_areaX" in the title and makes a list of them. The goal is then to combine the data frames in the list into 1 data frame. It almost works...
Area1 <- 100
Area2 <- 200
Area3 <- 300
Zone <- 3
a1_areaX <- data.frame(Area1)
a2_areaX <- data.frame(Area2)
a3_areaX <- data.frame(Area3)
a_zoneX <- data.frame(Zone)
library(dplyr)
pattern = "_areaX"
df_list <- mget(ls(envir = globalenv(), pattern = pattern))
big_data = bind_rows(df_list, .id = "FileName")
The problem is the newly created data frame looks like this:
And I need it to look like this:
File Name
Area measurement
a1_areaX
100
a2_areaX
200
a3_areaX
300
Below are my earlier attempts at asking this question. Edited from first version:
I have csv files imported into R Global Env that look like this (I'd share the actual file(s) but there doesn't seem to be a way to do this here):
They all have a name, the above one is called "s6_section_area". There are many of them (with different names) and I've put them all together into a list using this code:
pattern = "section_area"
section_area_list <- list(mget(grep(pattern,ls(globalenv()), value = TRUE), globalenv()))
Now I want a new data frame that looks like this, put together from the data frames in the above made list.
File Name
Area measurement
a1_section_area
a number
a2_section_area
another number
many more
more numbers
So, the first column should list the name of the original file and the second column the measurement provided in that file.
Hope this is clearer - Not sure how else to provide reproducible example without sharing the actual files (which doesn't seem to be an option).
addition to edit: Using this code
section_area_data <- bind_rows(section_area_list, .id = "FileName")
I get (it goes on and on to the right)
I'm after a table that looks like the sample above, left column is file name with a list of file names going down. Right column is the measurement for that file name (taken from original file).
Note that in your list of dataframes (df_list) all the columns have different names (Area1, Area2, Area3) whereas in your output dataframe they all have been combined into one single column. So for that you need to change the different column names to the same one and bind the dataframes together.
library(dplyr)
library(purrr)
result <- map_df(df_list, ~.x %>%
rename_with(~"Area", contains('Area')), .id = 'FileName')
result
# FileName Area
#1 a1_areaX 100
#2 a2_areaX 200
#3 a3_areaX 300
Thanks everyone for your suggestions. In the end, I was able to combine the suggestions and some more thinking and came up with this, which works perfectly.
library("dplyr")
pattern = "section_area"
section_area_list <- mget(ls(envir = globalenv(), pattern = pattern))
section_area_data <- bind_rows(section_area_list, .id = "FileName") %>%
select(-V1)
So, a bunch of csv files were imported into R Global Env. A list of all files with a name ending in "section_area" was made. Those files were than bound into one big data frame, with the file names as one column and the value (area measurement in this case) in the other column (there was a pointless column in the original csv files called "V1" which I deleted).
This is what one of the many csv files looks like
sample csv file
And this is the layout of the final data frame (it goes on for about 150 rows)
final data frame

Identifying difference among column names within a list of dataframes and changing them to have similar column names for all dataframes

In my working directory I have 100 csv files saved in data folder. (each of these dataframes have 200 variables and supposed to have same variable name) I tried to check the column names of these csv files- whether they are same or not. I did the followings-
dt_paths <- dir_ls(path = "data", glob = "*.csv")
map_chr(dt_paths, ~read_lines(.x, n_max = 1))
unname(map_chr(tri_paths, ~read_lines(.x, n_max = 1)))
table(unname(map_chr(dt_paths, ~read_lines(.x, n_max = 1))))
I find not all the data files have same column names(96 have similar column names and 4 has different column names).
My objective to
identify which files have different column names and
Then change those 4 files column names to similar names of other 96 files column names.
Is there any nice handy way of doing that?

Import multiple .txt files and merging them

There are around 3k .txt files, comma separated with equal structure and no col names.
e.g. 08/15/2018,11.84,11.84,11.74,11.743,27407 ///
I only need col1 (date) and col 5 (11.743) and would like to import all those vectores with the name of the .txt file assigned (AAAU.txt -> AAAU vector). In a second step I would like to merge them to a matrix, with all the possible dates in rows and colums with .txt filename and col5 value for each date.
I tried using readr, but I was unable to include the information of the filename, thus I cannot proceed.
Cheers for any help!
I didn't test this code, but I think this will work for you. You can use list.files() to pull in all file names into a variable, then read each one individually and append it to a new data frame with either rbind() or cbind()
setwd("C:/your_favorite_directory/")
fnames <- list.files()
csv <- lapply(fnames, read.csv)
result <- do.call(rbind, csv)
# grab a subset of the fields you need
df <- subset(result, select = c(a, e))
#then write your final file
write.table(df,"AllFiles.txt",sep=",")
Also, the '-' sign indicates dropping variables. Make sure the variable names would NOT be specified in quotes when using subset() function.
df = subset(mydata, select = -c(b,c,d) )

Assigning column names in R with another dataset

I'm trying to read in two data files, one is the actual data, the other is a file of column names in rows. I then need to assign the column names to the actual data. Below is what I have but its not assigning them properly.
#read in the data
glass_data = read.csv('/all_datasets/glass/glass.txt', header=FALSE)
glass_headers = read.csv('/all_datasets/glass/header.txt')
#add the names
names(glass_data) = c(glass_headers)
Would this work:
colnames(glass_data) <- glass_headers[, 1]

Iterating through unknown dates in R

I'm new to R (having worked in C++ and Python before) so this is probably just a factor of me not knowing some of R's nuances.
The program I'm working on is supposed to construct matrices of data by date. Here's how I might initialize such a matrix:
dates <- seq(as.Date("1980-01-01"), as.Date("2013-12-31"), by="days")
HN3 <- matrix(nrow=length(dates), ncol = 5, dimnames = list(as.character(dates), c("Value1", "Value2", "Value3", "Value4", "Value5")))
Notice that dates includes every day between 1980 and 2013.
So, from there, I have files containing certain dates and measurements of Value1, etc for those dates, and I need to read those files' contents into HN3. But the problem is that most of the files don't contain measurements for every day.
So what I want to do is read a file into a dataframe (say, v1read) with column 1 being dates and column 2 being the desired data. Then I'd match the dates of v1read to that date's row in HN3 and copy all of the relevant v1read values that way. Here is my attempt at doing so:
for (i in 1:nrow(v1read)) {
HN3[as.character(v1read[i,1]),Value1] <- v1read[i,4]
}
This gives me an out of index range error when the value of i is bumped up unexpectedly. I understand that R doesn't like to iterate through dates, but since the iterator itself is a numeric value rather than a date, I was hoping I'd found a loophole.
Any tips on how to accomplish this would be enormously appreciated.
Let's use library(dplyr). Start with
dates = seq(as.Date("1980-01-01"), as.Date("2013-12-31"), by="days")
HN3 = data.frame(Date=dates)
Now, load in your first file, the one that has a date and Value1.
file1 = read.file(value1.file) #I'm assuming this file has a column already named "Date" and one named #Value1
HN3 = left_join(HN3,file1,by="Date")
This will do a left join (SQL style) matching only the rows where a date exists and filling in the rest with NA. Now you have a data frame with two columns, Date and Value1. Load in your other files, do a left_join with each and you'll be done.

Resources