I have a directory containing many csv files which I have loaded into a dictionary of dataframes
So, just 3 sample small csv files to illustrate
import os
import csv
import pandas as pd
#create 3 small csv files for test purposes
os.chdir('c:/test')
with open('dat1990.csv','w',newline='') as fp:
a=csv.writer(fp,delimiter=',')
data = [['Stock','Sales','Year'],
['100','24','1990'],
['120','33','1990'],
['23','5','1990']]
a.writerows(data)
with open('dat1991.csv','w',newline='') as fp:
a=csv.writer(fp,delimiter=',')
data = [['Stock','Sales','Year'],
['400','35','1991'],
['450','55','1991'],
['34','6','1991']]
a.writerows(data)
with open('other1991.csv','w',newline='') as fp:
a=csv.writer(fp,delimiter=',')
data = [['Stock','Sales','Year'],
['500','56','1991'],
['600','44','1991'],
['56','55','1991']]
a.writerows(data)
create a dictionary for processing the csv files into dataframes
dfcsv_dict = {'dat1990': 'dat1990.csv', 'dat1991': 'dat1991.csv',
'other1991': 'other1991.csv'}
create a simple import function for importing csv to pandas
def myimport(csvfile):
return pd.read_csv(csvfile)
iterate through the dictionary to import all csv files into pandas dataframes
df_dict = {}
for k, v in dfcsv_dict.items():
df_dict[k] = myimport(v)
Given I now may have thousands of dataframes within the unified dictionary object, how can I select a few and "extract" them out of the dictionary?
So for example, how would I extract just two of these three dataframes nested in the dictionary, something like
dat1990 = df_dict['dat1990']
dat1991 = df_dict['dat1991']
but without using literal assignments. Maybe some sort of looping structure over the dictionary, hopefully with a means to select a subgroup based on a string sequence in the dictionary key:
eg all dataframes named dat or 1991 etc
I don't want another "sub dictionary" but want to extract them as named "standalone" dataframes as the above code illustrates.
I am using python 3.5.
This is an old question from Jan 2016 but since no one answered, here is an answer from Oct 2019. Might be useful for future reference.
I think you can skip the step of creating a dictionary of dataframes. I previously wrote an answer on how to create a single master dataframe from multiple CSV files, and adding a column in the master dataframe with a string extracted from the CSV filename. I think you could essentially do the same thing here.
Create a dataframe of csv files based on timestamp intervals
Steps:
Create path to folder with files
Create list of files in folder
Create empty dataframe to store CSV dataframes
Loop through each csv as a dataframe
Add a column with the filename as a string
Concatenate the individual dataframe to the master dataframe
Use a dataframe filter mask to create new dataframe
import pandas as pd
import os
# Step 1: create a path to the folder, syntax for Windows OS
path_test_folder = 'C:\\test\\'
# Step 2: create a list of CSV files in the folder
files_in_folder = os.listdir(path_test_folder)
files_in_folder = [x for x in files_in_folder if '.csv' in x]
# Step 3: create empty master dataframe to store CSV files
df_master = pd.DataFrame()
# Step 4: loop through the files in folder
for each_csv in files_in_folder:
# temporary dataframe for the CSV
path_csv = os.path.join(path_test_folder, each_csv)
temp_df = pd.read_csv(path_csv)
# add folder with filename
temp_df['str_filename'] = str(each_csv)
# combine into master dataframe
df_master = pd.concat([df_master, temp_df])
# then filter on your filenames
mask_filter = df_master['str_filename'].isin(['dat1990.csv', 'dat1991.csv'])
df_filter = df_master.loc[mask_filter]
Related
I have a data in different folders that I need to import and transform in a loop using purr. The path and names of the csv files follow the pattern below:
data/csd-alberta/
data/csd-ontario/
data/csd-pei/
data/csd-bc/
# for all of the province
c(alberta, bc, newbruns, newfoundland, nova, nunavut, nw, ont, pei, qc, sask, yukon)
There are many csv files in each province folder, but the main dataset I want to import starts with 98. For example:
# note that all data sets must begin with 98 and end with .csv.
csd_alberta_raw <- read_csv("csd-alberta/98-1.csv")
csd_bc_raw <- read_csv("csd-bc/98-2.csv")
csd_ont_raw <- read_csv("csd-ont/98-3.csv")
There are other csv files in the folder so I only need to import these that start with 98.
I would like to use purr and map_df to integrate the data transformation for all the files, since they all have the same columns and require the same data cleaning. But I'm not sure how to do this for all of the directory, and also specify the pattern for the csv.
You can use the following :
Use list.files to get the complete path of the filenames in all the folders with a specific pattern ('^98.*\\.csv$').
Use map_df to read the all the files and combine them. I have also included a new column called file which will identify the file from which data is coming from.
filenames <- list.files('data/', recursive = TRUE, full.names = TRUE, pattern = '^98.*\\.csv$')
combine_data <- purrr::map_df(filenames, readr::read_csv, .id = 'file')
I already know how to load a single CSV into a DataFrame:
using CSV
using DataFrames
df = DataFrame(CSV.File("C:\\Users\\username\\Table_01.csv"))
How would I do this when I have several CSV files, e.g. Table_01.csv, Table_02.csv, Table_03.csv?
Would I create a bunch of empty DataFrames and use a for loop to fill them? Or is there an easier way in Julia? Many thanks in advance!
If you want multiple data frames (not a single data frame holding the data from multiple files) there are several options.
Let me start with the simplest approach using broadcasting:
dfs = DataFrame.(CSV.File.(["Table_01.csv", "Table_02.csv", "Table_03.csv"]))
or
dfs = #. DataFrame(CSV.File(["Table_01.csv", "Table_02.csv", "Table_03.csv"]))
or (with a bit of more advanced stuff, using function composition):
(DataFrame∘CSV.File).(["Table_01.csv", "Table_02.csv", "Table_03.csv"])
or using chaining:
CSV.File.(["Table_01.csv", "Table_02.csv", "Table_03.csv"]) .|> DataFrame
Now other options are map as it was suggested in the comment:
map(DataFrame∘CSV.File, ["Table_01.csv", "Table_02.csv", "Table_03.csv"])
or just use a comprehension:
[DataFrame(CSV.File(f)) for f in ["Table_01.csv", "Table_02.csv", "Table_03.csv"]]
(I am listing the options to show different syntactic possibilities in Julia)
This is how I have done it, but there might be an easier way.
using DataFrames, Glob
import CSV
function readcsvs(path)
files=glob("*.csv", path) #Vector of filenames. Glob allows you to use the asterisk.
numfiles=length(files) #Number of files to read.
tempdfs=Vector{DataFrame}(undef, numfiles) #Create a vector of empty dataframes.
for i in 1:numfiles
tempdfs[i]=CSV.read(files[i]) #Read each CSV into its own dataframe.
end
masterdf=outerjoin(tempdfs..., on="Column In Common") #Join the temporary dataframes into one dataframe.
end
A simple solution where you don't have to explicitly enter filenames:
using CSV, Glob, DataFrames
path = raw"C:\..." # directory of your files (raw is useful in Windows to add a \)
files=glob("*.csv", path) # to load all CSVs from a folder (* means arbitrary pattern)
dfs = DataFrame.( CSV.File.( files ) ) # creates a list of dataframes
# add an index column to be able to later discern the different sources
for i in 1:length(dfs)
dfs[i][!, :sample] .= i # I called the new col sample
end
# finally, if you want, reduce your collection of dfs via vertical concatenation
df = reduce(vcat, dfs)
I need some help creating a dataset in R where each observation contains a latitude, longitude, and date. Right now, I have a list of roughly 2,000 files gridded by lat/long, and each file contains observations for one date. Ultimately, what I need to do, is combine all of these files into one file where each observation contains a date variable that is pulled from the name of its file.
So for instance, a file is named "MERRA2_400.tavg1_2d_flx_Nx.20120217.SUB.nc". I want all observations from that file to contain a date variable for 02/17/2012.
That "nc" extension describes a netCDF file, which can be read into R as follows:
library(RNetCDF)
setwd("~/Desktop/Thesis Data")
p1a<-"MERRA2_300.tavg1_2d_flx_Nx.20050101.SUB.nc"
pid<-open.nc(p1a)
dat<-read.nc(pid)
I know the ldply command can by useful for extracting and designating a new variable from the file name. But I need to create a loop that combines all the files in the 'Thesis Data' folder above (set as my wd), and gives them date variables in the process.
I have been attempting this using two separate loops. The first loop uploads files one by one, creates a date variable from the file name, and then resaves them into a new folder. The second loop concatenates all files in that new folder. I have had little luck with this strategy.
view[dat]
As you can hopefully see in this picture, which describes the data file uploaded above, each file contains a time variable, but that time variable has one observation, which is 690, in each file. So I could replace that variable with the date within the file name, or I could create a new variable - either works.
Any help would be much appreciated!
I do not have any experience working with .nc files, but what I think you need to do, in broad strokes, is this:
filenames <- list.files(path = ".") # Creates a character vector of all file names in working directory
Creating empty dataframe with column names:
final_data <- data.frame(matrix(ncol = ..., nrow = 0)) # enter number of columns you will have in the final dataset
colnames(final_data) <- c("...", "...", "...", ...) # create column names
For each filename, read in file, create date column and write as object in global environment:
for (i in filenames) {
pid<-open.nc(i)
dat<-read.nc(pid)
date <- ... # use regex to get your date from i and convert it into date
dat$date <- date
final_data <- rbind(final_data, dat)
}
I used the package 'GDELTtools' to download data from GDELT. Now, the data was downloaded however, no variable was stored in the global environment. I want to store the data into a dataframe variable so I can analyze it.
The folder contains over 30 zipped files. Every zipped file contains one csv. I need to store all these csvs in one variable in the Global Environment of r. I hope this can be done.
Thank you in advance!
Haven't written R for a while so I will try my best.
Read the comments carefully, cause they will explain the procedure.
I will attach the links to check information for: unzip, readCSV, mergeDataFrames, emptyDataFrame, concatinateStrings
According to docs of GDELTtools you can easily specify folder of download by providing local.folder="~/gdeltdata" as parameter to GetGDELT() function.
After that you can list.files("path/to/files/directory") function to obtain a vector of file names used in the explanation code bellow. Check the docs for more examples and explanation.
# set path to of unzip output
outDir <-"C:\\Users\\Name\\Documents\\unzipfolder"
# relative path where zip files are stored
relativePath <- "C:\\path\\to\\my\\directory\\"
# create varible to store all the paths to the zip files in a vector
zipPaths <- vector()
# since we have 30 files we should iterate through
# I assume you have a vector with file names in the variable fileNames
for (name in fileNamesZip) {
# Not sure if it will work but use paste() to concat strings
zipfilepath <- paste0(relativePath, name, ".zip")
# append filepath
append(zipPaths, zipfilepath)
}
# now we have a vector which contains all the paths to zip files
# use unzip() function and pass zipPaths to it. (Read official docs)
unzip(files=zipPaths, exdir=outDir)
# initialize dataframe for all the data. You must provide datatypes for the columns.
total <- data.frame=(Doubles=double(),
Ints=integer(),
Factors=factor(),
Logicals=logical(),
Characters=character(),
stringsAsFactors=FALSE)
# now its time to store data by reading csv files and storing them into dataframe.
# again, I assume you have a vector with file names in the variable fileNames
for (name in fileNamesCSV) {
# create the csv file path
csvfilepath <- paste0(outDir, name, ".csv")
# read data from csv file and store in in a dataframe
dataFrame = read.csv(file=csvfilepath, header=TRUE, sep=",")
# you will be able to merge dataframes only if they are equal in structure. Specify the column names to merge by.
total <- merge(data total, data dataFrame, by=c("Name1","Name2"))
}
Something potentially much simpler:
list.files() lists the files in a directory
readr::read_csv() will automatically unzip files as necessary
dplyr::bind_rows() will combine data frames
So try:
lf <- list.files(pattern="\\.zip")
dfs <- lapply(lf,readr::read_csv)
result <- dplyr::bind_rows(dfs)
I have ~45 files of 5-6 Mo containing over 3000 json objects that I want to work with in R. I've been able to import each jsonr file independantly with fromJSON() as a list except one for which I had to use stream_in(), but am having trouble coercing it into a useful structure. I want to create a data frame merging with rbind all the files. The goal is to merge the result with the other file using cbind.
allfiles <- list.files()
for (file in allfiles) {
jsonFusion <- fromJSON(file)
file 1 <- do.call(rbind,jsonFusion)
}
stream_in(file("files2"))
The first step (loop) is a little bit slow and I don't know how to merge file 1 and file 2 and more how to have a dataframe!!!!
the function as.data.frame() is not working
Assuming the data structures are consistent.
library(jsonlite)
all_files <- list.files(path = "path/to/files", full.names = TRUE)
rbind.pages(lapply(all_files,fromJSON))