How to apply a function to each row in SparkR?

How to apply a function to each row in SparkR? - r

I have a file in CSV format which contains a table with column "id", "timestamp", "action", "value" and "location".
I want to apply a function to each row of the table and I've already written the code in R as follows:
user <- read.csv(file_path,sep = ";")
num <- nrow(user)
curLocation <- "1"
for(i in 1:num) {
row <- user[i,]
if(user$action != "power")
curLocation <- row$value
user[i,"location"] <- curLocation
}
The R script works fine and now I want to apply it SparkR. However, I couldn't access the ith row directly in SparkR and I couldn't find any function to manipulate every row in SparkR documentation.
Which method should I use in order to achieve the same effect as in the R script?
In addition, as advised by #chateaur, I tried to code using dapply function as follows:
curLocation <- "1"
schema <- structType(structField("Sequence","integer"), structField("ID","integer"), structField("Timestamp","timestamp"), structField("Action","string"), structField("Value","string"), structField("Location","string"))
setLocation <- function(row, curLoc) {
if(row$Action != "power|battery|level"){
curLoc <- row$Value
}
row$Location <- curLoc
}
bw <- dapply(user, function(row) { setLocation(row, curLocation)}, schema)
head(bw)
Then I got an error:
I looked up the warning message the condition has length > 1 and only the first element will be used and I found something https://stackoverflow.com/a/29969702/4942713. It made me wonder whether the row parameter in the dapply function represent an entire partition of my data frame instead of one single row? Maybe dapply function is not a desirable solution?
Later, I tried to modify the function as advised by #chateaur. Instead of using dapply, I used dapplyCollect which saves me the effort of specifying the schema. It works!
changeLocation <- function(partitionnedDf) {
nrows <- nrow(partitionnedDf)
curLocation <- "1"
for(i in 1:nrows){
row <- partitionnedDf[i,]
if(row$action != "power") {
curLocation <- row$value
}
partitionnedDf[i,"location"] <- curLocation
}
partitionnedDf
}
bw <- dapplyCollect(user, changeLocation)

Scorpion775,
You should share your sparkR code. Don't forget that data isn't manipulated the same way in R and sparkR.
From : http://spark.apache.org/docs/latest/sparkr.html,
df <- read.df(csvPath, "csv", header = "true", inferSchema = "true", na.strings = "NA")
Then you can look at dapply function here : https://spark.apache.org/docs/2.1.0/api/R/dapply.html
Here is a working example :
changeLocation <- function(partitionnedDf) {
nrows <- nrow(partitionnedDf)
curLocation <- as.integer(1)
# Loop over each row of the partitionned data frame
for(i in 1:nrows){
row <- partitionnedDf[i,]
if(row[1] != "power") {
curLocation <- row[2]
}
partitionnedDf[i,3] <- curLocation
}
# Return modified data frame
partitionnedDf
}
# Load data
df <- read.df("data.csv", "csv", header="false", inferSchema = "true")
head(collect(df))
# Define schema of dataframe
schema <- structType(structField("action", "string"), structField("value", "integer"),
structField("location", "integer"))
# Change location of each row
df2 <- dapply(df, changeLocation, schema)
head(df2)

Related

Breaking the golden rules on FOR loops with R

Firstly, apologies as this may seem a bit long winded, but I hope to give as much information on this problem as I can...
I have written a script that loops through a set of files defined in a csv file. Each file within this csv listing is an XML file, each one is for a particular event in an application, and all files within this list are of the same event type. However, each file can contain different data. For instance, one could hold an attribute with no child nodes beneath, while others contain nodes.
My script works perfectly fine, but when it gets to about XML file 5000, it has slowed down considerably.
Problem is that my code creates a blank dataframe initially, and then grows is at new columns are detected.
I understand that this is a big NO NO when it comes to writing R FOR loops, but am unsure how to get around this problem, give my smallest file listing is 69000, which makes going through each one in turn and counting the nodes a task in itself.
Are there any ideas on how to get around this?
pseudo code or actual R code to do this would be great. So would ideas/opinions, as I am unsure on the best approach to this task.
Here is my current code.
library(XML)
library(xml2)
library(plyr)
library(tidyverse)
library(reshape2)
library(foreign)
library(rio)
# Get file data to be used
#
setwd('c:/temp/xml')
headerNames <- c('GUID','EventId','AppId','RequestFile', 'AE_Type', 'AE_Drive')
GetNames <- rowid_to_column(read.csv(file= 'c:/temp/xml/R_EventIdA.csv', fileEncoding="UTF-8-BOM", header = FALSE, col.names = headerNames),'ID')
inputfiles <- as.character(GetNames[,5]) # Gets list of files
# Create empty dataframes
#
df <- data.frame()
transposed.df1 <- data.frame()
allxmldata <- data.frame()
findchildren<-function(nodes, df) {
numchild <- sapply(nodes, function(x){length(xml_children(x))})
xml.value <- xml_text(nodes[numchild==0])
xml.name <- xml_name(nodes[numchild==0])
xml.path <- sapply(nodes[numchild==0], function(x) {gsub(', ','_', toString(rev(xml_name(xml_parents(x)))))})
fieldname <- paste(xml.path,xml.name,sep = '_')
contents <- sapply(xml.value, function(f){is.na(f)<-which(f == '');f})
if (length(fieldname) > 0) {
fieldname <- paste(fieldname,xml.value, sep = '_')
dftemp <- data.frame(fieldname, contents)
df <- rbind(df, dftemp)
print(dim(df))
}
if (sum(numchild)>0){
findchildren(xml_children(nodes[numchild>0]), df) }
else{ return(df)
}
}
findchildren2<-function(nodes, df){
numchild<-sapply(nodes, function(embeddedinputfile){length(xml_children(embeddedinputfile))})
xmlvalue<-xml_text(nodes[numchild==0])
xmlname<-xml_name(nodes[numchild==0])
xmlpath<-sapply(nodes[numchild==0], function(embeddedinputfile) {gsub(', ','_', toString(rev(xml_name(xml_parents(embeddedinputfile)))))})
fieldname<-paste(xmlpath,xmlname,sep = '_')
contents<-sapply(xmlvalue, function(f){is.na(f)<-which(f == '');f})
if (length(fieldname) > 0) {
dftemp<-data.frame(fieldname, contents)
df<-rbind(df, dftemp)
print(dim(df))
}
if (sum(numchild)>0){
findchildren2(xml_children(nodes[numchild>0]), df) }
else{ return(df)
}
}
# Loop all files
#
for (x in inputfiles) {
df1 <- findchildren(xml_children(read_xml(x)),df)
## original xml dataframe
if (length(df1) > 0) {
xml.df1 <- data.frame(spread(df1, key = fieldname, value = contents), fix.empty.names = TRUE)
}
##
xml.df1 %>%
pluck('Response_RawData') -> rawxml
if (length(rawxml)>0) {
df.rawxml <- data.frame(rawxml)
export(df.rawxml,'embedded.xml')
embeddedinputfile <-as.character('embedded.xml')
rm(df1)
df1 <- findchildren2(xml_children(read_xml(embeddedinputfile)),df)
if (length(df1) > 0) {
xml.df2 <- spread(df1, key = fieldname, value = contents)
}
allxmldata <- rbind.fill(allxmldata,cbind(xml.df1,xml.df2))
} else {
allxmldata <- rbind.fill(allxmldata,cbind(xml.df1))
}
}
if(nrow(allxmldata)==nrow(GetNames)) {
alleventdata<-cbind(GetNames,allxmldata)
}
dbConn2 <- odbcDriverConnect('driver={SQL Server};server=PC-XYZ;database=Events;trusted_connection=true')
sqlSave(dbConn2, alleventdata, tablename = 'AE_EventA', append = TRUE )

R Data Frames column names rename

I am new to R and not sure why I have to rename data frame column names at the end of the program though I have defined data frame with column names at the beginning of the program. The use of the data frame is, I got two columns where I have to save sequence under ID column and some sort of number in NOBS column.
complete <- function(directory, id = 1:332) {
collectCounts = data.frame(id=numeric(), nobs=numeric())
for(i in id) {
fileName = sprintf("%03d",i)
fileLocation = paste(directory, "/", fileName,".csv", sep="")
fileData = read.csv(fileLocation, header=TRUE)
completeCount = sum(!is.na(fileData[,2]), na.rm=TRUE)
collectCounts <- rbind(collectCounts, c(id=i, completeCount))
#print(completeCount)
}
colnames(collectCounts)[1] <- "id"
colnames(collectCounts)[2] <- "nobs"
print(collectCounts)
}

Its not quite clear what your specific problem is, as you did not provide a complete and verifiable example. But I can give a few pointers on improving the code, nonetheless.
1) It is not recommended to 'grow' a data.frame within a loop. This is extremely inefficient in R, as it copies the entire structure each time. Better is to assign the whole data.frame at the outset, then fill in the rows in the loop.
2) R has a handy functionpaste0 that does not require you to specify sep = "".
3) There's no need to specify na.rm = TRUE in your sum, because is.na will never return NA's
Putting this together:
complete = function(directory, id = 1:332) {
collectCounts = data.frame(id=id, nobs=numeric(length(id)))
for(i in 1:length(id)) {
fileName = sprintf("%03d", id[i])
fileLocation = paste0(directory, "/", fileName,".csv")
fileData = read.csv(fileLocation, header=TRUE)
completeCount = sum(!is.na(fileData[, 2]))
collectCounts[i, 'nobs'] <- completeCount
}
}

Always hard to answer questions without example data.
You could start with
collectCounts = data.frame(id, nobs=NA)
And in your loop, do:
collectCounts[i, 2] <- completeCount
Here is another way to do this:
complete <- function(directory, id = 1:332) {
nobs <- sapply(id, function(i) {
fileName = paste0(sprintf("%03d",i), ".csv")
fileLocation = file.path(directory, fileName)
fileData = read.csv(fileLocation, header=TRUE)
sum(!is.na(fileData[,2]), na.rm=TRUE)
}
)
data.frame(id=id, nobs=nobs)
}

How do create one data.frame with multiple csv files in R using a function? [duplicate]

This question already has answers here:
What's wrong with my function to load multiple .csv files into single dataframe in R using rbind?
(6 answers)
Closed 5 years ago.
I am quite new to R and I need some help. I have multiple csv files labeled from 001 to 332. I would like to combine all of them into one data.frame. This is what I have done so far:
filesCSV <- function(id = 1:332){
fileNames <- paste(id) ## I want fileNames to be a vector with the names of all the csv files that I want to join together
for(i in id){
if(i < 10){
fileNames[i] <- paste("00",fileNames[i],".csv",sep = "")
}
if(i < 100 && i > 9){
fileNames[i] <- paste("0", fileNames[i],".csv", sep = "")
}
else if (i > 99){
fileNames[i] <- paste(fileNames[i], ".csv", sep = "")
}
}
theData <- read.csv(fileNames[1], header = TRUE) ## here I was trying to create the data.frame with the first csv file
for(a in 2:332){
theData <- rbind(theData,read.csv(fileNames[a])) ## here I wanted to use a for loop to cycle through the names of files in the fileNames vector, and open them all and add them to the 'theData' data.frame
}
theData
}
Any help would be appreciated, Thanks!

Hmm it looks roughly like your function should already be working. What is the issue?
Anyways here would be a more idiomatic R way to achieve what you want to do that reduces the whole function to three lines of code:
Construct the filenames:
infiles <- sprintf("%03d.csv", 1:300)
the %03d means: insert an integer value d padded to length 3 zeroes (0). Refer to the help of ?sprintf() for details.
Read the files:
res <- lapply(infiles, read.csv, header = TRUE)
lapply maps the function read.csv with the argument header = TRUE to each element of the vector "infiles" and returns a list (in this case a list of data.frames)
Bind the data frames together:
do.call(rbind, res)
This is the same as entering rbind(df1, df2, df3, df4, ..., dfn) where df1...dfn are the elments of the list res

You were very close; just needed ideas to append 0s to files and cater for cases when the final data should just read the csv or be an rbind
filesCSV <- function(id = 1:332){
library(stringr)
# Append 0 ids in front
fileNames <- paste(str_pad(id, 3, pad = "0"),".csv", sep = "")
# Initially the data is NULL
the_data <- NULL
for(i in seq_along(id)
{
# Read the data in dat object
dat <- read.csv(fileNames[i], header = TRUE)
if(is.null(the_data) # For the first pass when dat is NULL
{
the_data <- dat
}else{ # For all other passes
theData <- rbind(theData, dat)
}
}
return(the_data)
}

Remove path from variable name in a dataframe

I've put together a function that looks like this, with the first comment lines being an example. Most importantly here is the set.path variable that I use to set the path initially for the function.
# igor.import(set.path = "~/Desktop/Experiment1 Folder/SCNavigator/Traces",
# set.pattern = "StepsCrop.ibw",
# remove.na = TRUE)
igor.multifile.import <- function(set.path, set.pattern, remove.na){
{
require("IgorR")
require("reshape2")
raw_list <- list.files(path= set.path,
pattern= set.pattern,
recursive= TRUE,
full.names=TRUE)
multi.read <- function(f) { # Note that "temp.data" is just a placeholder in the function
temp_data <- as.vector(read.ibw(f)) # Change extension to match your data type
}
my_list <- sapply(X = raw_list, FUN = multi.read) # Takes all files gathered in raw_list and applies multi.read()
my_list_combined <- as.data.frame(do.call(rbind, my_list))
my_list_rotated <- t(my_list_combined[nrow(my_list_combined):1,]) # Matrix form
data_out <- melt(my_list_rotated) # "Long form", readable by ggplot2
data_out$frame <- gsub("V", "", data_out$Var1)
data_out$name <- gsub(set.path, "", data_out$Var2) # FIX THIS
}
if (remove.na == TRUE){
set_name <- na.omit(data_out)
} else if (remove.na == FALSE) {
set_name <- data_out
} else (set_name <- data_out)
}
When I run this function I'll get a large dataframe, where each file that matched the pattern will show up with a name like
/Users/Joh/Desktop/Experiment1 Folder/SCNavigator/Traces/Par994/StepsCrop.ibw`
that includes the entire filepath, and is a bit unwieldy to look at and deal with.
I've tried to remove the path part with the line that says
data_out$name <- gsub(set.path, "", data_out$Var2)
Similar to the command above that removes the dataframe auto-named V1, V2, V3... (which works). I can't remove the string part matching the set.path = "my/path/" though.

Regardless of what your set.path is, you can eliminate it by
gsub(".*/","",mypath)
mypath<-"/Users/Joh/Desktop/Experiment1 Folder/SCNavigator/Traces/Par994/StepsCrop.ibw"
gsub(".*/","",mypath)
[1] "StepsCrop.ibw"
`

Overwriting result with for loop in R

I have a number of csv files and my goal is to find the number of complete cases for a file or set of files given by id argument. My function should return a data frame with column id specifying the file and column obs giving the number of complete cases for this id. However, my function overwrites the previous value of nobs in each loop and the resulting data frame gives me only its last value. Do you have any idea how to get the value of nobs for each value of id?
myfunction<-function(id=1:20) {
files<-list.files(pattern="*.csv")
myfiles = do.call(rbind, lapply(files, function(x) read.csv(x,stringsAsFactors = FALSE)))
for (i in id) {
good<-complete.cases(myfiles)
newframe<-myfiles[good,]
cases<-newframe[newframe$ID %in% i,]
nobs<-nrow(cases)
}
clean<-data.frame(id,nobs)
clean
}
Thanks.

We can do all inside lapply(), something like below (not tested):
myfunction <- function(id = 1:20) {
files <- list.files(pattern = "*.csv")[id]
do.call(rbind,
lapply(files, function(x){
df <- read.csv(x,stringsAsFactors = FALSE)
df <- df[complete.cases(df), ]
data.frame(ID=x,nobs=nrow(df))
}
)
)
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to apply a function to each row in SparkR? - r

Related

Breaking the golden rules on FOR loops with R

R Data Frames column names rename

How do create one data.frame with multiple csv files in R using a function? [duplicate]

Remove path from variable name in a dataframe

Overwriting result with for loop in R

Categories

Resources