i want read a source file and write data into .Csv file in Spark scala with additional identity columns - scale

i want to read a CSV file and store into a csv file with few additional columns like auto-generated columns identity columns, load date time.
I am using spark 2.0.

You can use withColumn method to add columns to DataFrame.
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset#withColumn(colName:String,col:org.apache.spark.sql.Column):org.apache.spark.sql.DataFrame

Related

is there a library that can generate csv files given a data dictionary and data model in some format

is there a library in any language that can generate .csv files for each entity of the data model that complies with a data dictionary.
For example:
data dictionary is specified in a csv file with these column names - field,regex,description
data model is specified in another csv file with these column names - entity,field
faker comes very close however it needs some programming to work for a data model. If there is a wrapper around faker, that might work great I suppose.

Trying to create new columns using header information, add a column containing the file name and merge multiple csv files in R

I have only recently started using R and am now trying to automate some tasks with it. I've a task where I want to merge information from ~300 .csv files. Each file is in the same format with information in a header section followed by data in standard columns.
I want to
Create a new column that contains the file name
Create columns that use header information (e.g. lot number) on each row in the file
Merge all csv files in a folder together.
I've seen bits of code that can merge csv files together using list_files(), lapply() and bind_rows() but struggling to get the header information into new columns before merging the csv files together.
sample of csv file
Has anyone a solution to this?

How to use for loop to read and append multiple csv files in R?

I am a student and just started learning R. I have 19 excel csv files. I want to read the files one by one and append the new rows into a data frame. It is recommended to use functions from tidyverse package to read and to import these files. The first 7 rows of each file are metadata which need to be skipped. How can I do these steps inside a for loop?

crawl excel data automatically based on contents

I want to get data from many excel files with the same format like this:
What I want is to output the ID data from column B to a CSV file. I have many files like this and for each file, the number of columns may not be the same but the ID data will always be in the B column.
Is there a package in Julia that can crawl data in this format? If not, what method should I use?
You can use the XLSX package.
If the file in your screenshot is called JAKE.xlsx and the data shown is in a sheet called DataSheet:
data = XLSX.readtable("JAKE.xlsx", "DataSheet")
# `data[1]` is a vector of vectors, each with data for a column.
# that way, `data[1][2]` correponds to column B's data.
data[1][2]
This should give you access to a vector with the data you need. After getting the IDs into a vector, you can use the CSV package to create an output file.
If you add a sample xlsx file to your post it might be possible to give you a more complete answer.

read a selected column from multiple csv files and combine into one large file in R

Hi,
My task is to read selected columns from over 100 same formatted .csv files in a folder, and to cbind into a big large file using R. I have attached a screen shot in this question for a sample data file.
This is the code I'm using:
filenames <- list.files(path="G:\\2014-02-04")
mydata <- do.call("cbind",lapply(filenames,read.csv,skip=12))
My problem is, for each .csv file I have, the first column is the same. So using my code will create a big file with duplicate first columns... How can I create a big with just a single column A (no duplictes). And I would like to name the second column read from each .csv file using the value of cell B7, which is the specific timestamp of each .csv file.
Can someone help me on this?
Thanks.

Resources