I've explored but didn't find any suggestions.
I want to know how to save an empty file to adlsgen2 using R and then read it back to the same code.
Thanks for the help.
Since you want to write an empty file to ADLS gen2 using R and read it from that location, first create a dataframe.
The dataframe can either be completely empty, or it can be a dataframe with column names but no rows. You can use the following code to create one.
df <- data.frame() #OR
df <- data.frame(<col_name>=<type>)
Once you have created a dataframe, you must establish connection to ADLS gen2 account. You can use the method specified in the following link to do that.
https://saketbi.wordpress.com/2019/05/11/how-to-connect-to-adls-gen2-using-sparkr-from-databricks-rstudio-while-integrating-securely-with-azure-key-vault/
After making the connection you can use read and write functions in R language using the path to ADLS gen2 storage. The following link provides numerous functions that can be used according to your requirement.
https://www.geeksforgeeks.org/reading-files-in-r-programming/
Sharing the solution to my question:
I've used SparkR to create an empty data frame and save the same in the adls after setting up the spark context.
Solution is below:
Step 1: Create a schema
fcstSchema <- structType(structField("ABC", "string",TRUE))
new_df<- data.frame(ABC=NULL,stringsAsFactors=FALSE)
n<-createDataFrame(new_df,fcstSchema)
SparkR::saveDF(n,path = "abfss://<account_name>#<datalake>.dfs.core.windows.net/<path>/",source = "delta",mode = "overwrite")
Related
I want to export a dataset containing personal IDs and other variables to REDCap. Does anyone know how to do it?
I have found a package called REDCapR but it just contains importing from R.
REDCapR::redcap_write() moves info from a data.frame on your local R instance to the remote REDCap Server.
# Read the dataset for the first time.
result_read1 <- REDCapR::redcap_read_oneshot(redcap_uri=uri, token=token)
ds1 <- result_read1$data
ds1$telephone
# Manipulate a field in the dataset in a VALID way
ds1$telephone <- paste0("(405) 321-000", seq_len(nrow(ds1)))
ds1 <- ds1[1:3, ]
ds1$age <- NULL; ds1$bmi <- NULL # Drop the calculated fields before writing.
# Upload the data to the server.
result_write <- REDCapR::redcap_write(ds1, redcap_uri=uri, token=token)
Documentation and references:
The function's page in the reference manual: https://ouhscbbmc.github.io/REDCapR/reference/redcap_write.html
Troubleshooting document: https://ouhscbbmc.github.io/REDCapR/articles/TroubleshootingApiCalls.html#writing
Other vignettes that describe global approaches to working with R & REDCap: https://ouhscbbmc.github.io/REDCapR/articles/index.html
(I'm the primary REDCapR developer. The REDCapR::redcap_write() function was one of first functions added in 2013.)
I have my final output in R dataframe. I need to write this output to a database in Azure Databricks. Can someone help me with the syntax? I used this code:
require(SparkR)
data1 <- createDataFrame(output)
write.df(data1, path="dbfs:/datainput/sample_dataset.parquet",
source="parquet", mode="overwrite")
This code runs without error, but i dont see the database in the datainput folder (mentioned in the path). Is there some other way to do it?
I believe you are looking for saveAsTable function. write.df is particularly to save the data in a file system only, not to tag the data as table.
require(SparkR)
data1 <- createDataFrame(output)
saveAsTable(data1, tableName = "default.sample_table", source="parquet", mode="overwrite")
In the above code, default is some existing database name, under which a new table will get created having name as sample_table. If you mention sample_table instead of default.sample_table then it will be saved in the default database.
I am trying to remove bias from a microscopy analysis, so I want to make it so the experimenter doesn't know what the conditions are for the image they are looking at.
To do this I need to rename every file in a directory so they can't be identified, but I also need to be able to know what the original filename was subsequently.
I made a folder with three files in it to try this out. I got the file list and made a vector for the new names, and combined into a data frame .
setwd("~/Desktop/folder1")
filename_list<-list.files("~/Desktop/folder1")
new_filenames <- c("anon1", "anon2", "anon3")
require(reshape2)
df1 <- melt(data.frame(filename_list,new_filenames))
View(df1)
I've also been able to change names using scripts from a previous question
and r bloggers using sapply and file.rename. I got a little stuck with using wildcards in this to select the whole filename (minus extension) but i'm sure it's possible;
sapply(filename_list,FUN=function(eachPath){file.rename(from=eachPath,to=sub(pattern="image_",replacement="anon",eachPath))})
How I can get the new_filenames vector and apply it to file.rename so it corresponds to the original_filenames vector in the df1 data frame,
or is there a better way to do this? Thanks.
Using AZ ML workbench for a class project (required tool) I coded the desired logic below in an exploration notebook but cannot find a way to include this into a Data-prep Transform Data flow.
all_columns = df.columns
sum_columns = [col_name for col_name in all_columns if col_name not in ['NPI', 'Gender', 'State', 'Credentials', 'Specialty']]
sum_op_columns = list(set(sum_columns) & set(df_op['Drug Name'].values))
The logic is using the column names from one data source df_op (opioid drugs) to choose which subset of columns to include from another data source df (all drugs). When adding a py script/expression Transform Data Flow I'm only seeing the ability to reference the single df. Alternatives?
I may have a way for you to access both data frames.
In Workbench, once you have the data sources that you need loaded, right click on one and select "Generate Data Access Code File".
Once there you're automatically given code to access that specific file. However, you can use the same code to access the other files.
In the screenshot above, I have two data sources. I can use the below code to access them both as a pandas data frame and manipulate them as I need.
df_salary = datasource.load_datasource('SalaryData.dsource')
df_startup = datasource.load_datasource('50-Startups.dsource')
I believe from there you can save your updated data frame to a CSV and then use that in the train script.
Hope that helps or at least points you to another solution.
I using the Alteryx R Tool to sign an amazon http request. To do so, I need the hmac function that is included in the digest package.
I'm using a text input tool that includes the key and a datestamp.
Key= "foo"
datastamp= "20120215"
Here's the issue. When I run the following script:
the.data <- read.Alteryx("1", mode="data.frame")
write.Alteryx(base64encode(hmac(the.data$key,the.data$datestamp,algo="sha256",raw = TRUE)),1)
I get an incorrect result when compared to when I run the following:
write.Alteryx(base64encode(hmac("foo","20120215",algo="sha256",raw = TRUE)),1)
The difference being when I hardcode the values for the key and object I get the correct result. But if use the variables from the R data frame I get incorrect output.
Does the data frame alter the data in someway. Has anyone come across this when working with the R Tool in Alteryx.
Thanks for your input.
The issue appears to be that when creating the data frame, your character variables are converted to factors. The way to fix this with the data.frame constructor function is
the.data <- data.frame(Key="foo", datestamp="20120215", stringsAsFactors=FALSE)
I haven't used read.Alteryx but I assume it has a similar way of achieving this.
Alternatively, if your data frame has already been created, you can convert the factors back into character:
write.Alteryx(base64encode(hmac(
as.character(the.data$Key),
as.character(the.data$datestamp),
algo="sha256",raw = TRUE)),1)