How to rename columns in Sparklyr in R?

How to rename columns in Sparklyr in R? - r

This is the code I have used in R via Spark Cluster, and error also given below
mydata<-spark_read_csv(spark_cluster,name = "rd_1",path = "IAF_Extracted_Data_Zipped.csv",header = F,delimiter = "|")
mydata %>% select(customer=V1,device_subscriber_id=V2,user_subscriber_id=V3,user_id=V4,location_id=V5)
Error in .f(.x[[i]], ...) : object 'V1' not found

The renaming convention goes the other way around (new name = old name)
You are looking for the following:
mydata %>%
select(V1 = customer,
V2 = device_subscriber_id,
V3 = user_subscriber_id,
V4 = user_id,
V5 = location_id)

If you want specific names just provide a vector of names on read:
columns <- c("customer", "device_subscriber_id",
"user_subscriber_id", "user_id", "location_id")
spark_read_csv(
spark_cluster, name = "rd_1",path = "IAF_Extracted_Data_Zipped.csv",
header = FALSE, columns = columns, delimiter = "|"
)
The number of columns should match the number of columns in the input.

Of the top of my head you could try customer = mydata$V1 and similar for the other variables (assuming V1,... are column names of mydata).

Related

regex match with fuzzyjoin / dplyr

I have two data frames that I want to join by the first column and to ignore the case:
df3<- data.frame("A" = c("XX28801","ZZ9"), "B" = c("one","two"),stringsAsFactors = FALSE)
df4<- data.frame("Z" = c("X2880","Zz9"),"C" = c("three", "four"), stringsAsFactors = FALSE)
What I want is this:
df5<- data.frame(A = c("XX28801","ZZ9"), B = c("one","two"), Z = c(NA,"Zz9"), C = c(NA, "four"))
but interestingly, I get this using the fuzzyjoin package:
join <- regex_left_join(df3,df4,by= c("A" = "Z"), ignore_case = TRUE)
It's good ZZ9 and Zz9 matched but I have no idea why XX28801 matched with X2880. The only similarity is the X2880 in XX28801.
I also don't want to uppercase/lowercase the values before joining as I want column A and column Z to retain their original values. Thanks.

Regex joins join on regular expressions, this searchers for the text in the right hand table within the text of the left hand table. So as "X2880" is found within "XX28801" this is considered a match.
To understand regex better, you might find it useful to explore some comparisons using grepl(pattern, text) this returns true/false if the pattern is found within text:
> grepl('X2880', 'XX28801', ignore.case = TRUE)
[1] TRUE
It seems like you want to match only when the entire text string matches the entire text string, other than capital/lowercase. For this I would recommend you create temporary columns to join on:
df3_w_lower = df3 %>%
mutate(A_for_join = tolower(A))
df4_w_lower = df4 %>%
mutate(Z_for_join = tolower(Z))
join = left_join(df3_w_lower, df4_w_lower, by = c("A_for_join" = "Z_for_join")) %>%
select(-A_for_join, - Z_for_join)
By using temporary columns for joining you preserve the capitalization in the original columns.

convert character column and then split it into multiple new boolean columns using r mutate

I am attempting to split out a flags column into multiple new columns in r using mutate_at and then separate functions. I have simplified and cleaned my solution as seen below, however I am getting an error that indicates that the entire column of data is being passed into my function rather than each row individually. Is this normal behaviour which just requires me to loop over each element of x inside my function? or am I calling the mutate_at function incorrectly?
example data:
dataVariable <- data.frame(c_flags = c(".q.q.q","y..i.o","0x5a",".lll.."))
functions:
dataVariable <- read_csv("...",
col_types = cols(
c_date = col_datetime(format = ""),
c_dbl = col_double(),
c_flags = col_character(),
c_class = col_factor(c("a", "b", "c")),
c_skip = col_skip()
))
funTranslateXForNewColumn <- function(x){
binary = ""
if(startsWith(x, "0x")){
binary=hex2bin(x)
} else {
binary = c(0,0,0,0,0,0)
splitFlag = strsplit(x, "")[[1]]
for(i in splitFlag){
flagVal = 1
if(i=="."){
flagVal = 0
}
binary=append(binary, flagVal)
}
}
return(paste(binary[4:12], collapse='' ))
}
mutate_at(dataVariable, vars(c_flags), funs(funTranslateXForNewColumn(.)))
separate(dataVariable, c_flags, c(NA, "flag_1","flag_2","flag_3","flag_4","flag_5","flag_6","flag_7","flag_8","flag_9"), sep="")
The error I am receiving is:
Warning messages:
1: Problem with `mutate()` input `c_flags`.
i the condition has length > 1 and only the first element will be used
After translating the string into an appropriate binary representation of the flags, I will then use the seperate function to split it into new columns.

Similar to OP's logic but maybe shorter :
dataVariable$binFlags <- sapply(strsplit(dataVariable$c_flags, ''), function(x)
paste(as.integer(x != '.'), collapse = ''))
If you want to do this using dplyr we can implement the same logic as :
library(dplyr)
dataVariable %>%
mutate(binFlags = purrr::map_chr(strsplit(c_flags, ''),
~paste(as.integer(. != '.'), collapse = '')))
# c_flags binFlags
#1 .q.q.q 010101
#2 y..i.o 100101
#3 .lll.. 011100
mutate_at/across is used when you want to apply a function to multiple columns. Moreover, I don't see here that you are creating only one new binary column and not multiple new columns as mentioned in your post.

I was able to get the outcome I desired by replacing the mutate_at function with:
dataVariable$binFlags <- mapply(funTranslateXForNewColumn, dataVariable$c_flags)
However I want to know how to use the mutate_at function correctly.
credit to: https://datascience.stackexchange.com/questions/41964/mutate-with-custom-function-in-r-does-not-work
The above link also includes the solution to get this function to work which is to vectorize the function:
v_funTranslateXForNewColumn <- Vectorize(funTranslateXForNewColumn)
mutate_at(dataVariable, vars(c_flags), funs(v_funTranslateXForNewColumn(.)))

Error in fix.by (by.x, x): 'by' must define one or more columns as numbers, names or logical data

I have some files
file_analysis = read.xlsx(listfiles[1],sheetIndex = 1,header = FALSE)###
#View(file_analysis)
str(file_analysis)
file_comments = read.csv("C:/Users/adm/Downloads/comments.csv",sep=";")
#View(file_comments)
file_groups = read.xlsx(listfiles[6],sheetIndex = 1,header = FALSE) ####
#View(file_groups)
file_headeers = read.xlsx(listfiles[7],sheetIndex = 1,header = FALSE)
file_photos = read.csv("C:/Users/adm/Downloads/photos.csv",sep=";")
#View(file_photos)
file_profiles = read.xlsx(listfiles[12],sheetIndex = 1,header = FALSE) ####
#View(file_profiles)
file_profiles3 = read.xlsx(listfiles[13],sheetIndex = 1,header = FALSE)###
#View(file_profiles)
file_statistics = read.csv("C:/Users/adm/Downloads/statistics.csv",sep=";")
#View(file_statistics)
file_videos = read.csv("C:/Users/adm/Downloads/videos.csv",sep=";")
#View(file_videos)
i need it merge to one dataset
simple way
n=merge(file_comments,file_groups,file_photos ,file_profiles,
file_profiles3,file_statistics,
file_videos, by ="owner_id")
but it returns me error
Error in fix.by (by.x, x): 'by' must define one or more columns as numbers, names or logical data
this
Error in fix.by(by.x, x) : 'by' must specify a uniquely valid columnmergedata <- merge (dataset1, dataset2, by.x="personalid")
and this
Merging data - Error in fix.by(by.x, x)
is not help me. And i don't know why.
owner_id is numeric
example
258894746
3389571
3389572
3389573
3389574
118850
What's wrong?
i need join all files at once.

merge does not accept more than two dataframes. You should apply it recursively with Reduce or purrr::reduce function see here
Base R
Reduce(function(dtf1, dtf2) merge(dtf1, dtf2, by = "owner_id"),
list(file_comments,file_groups,file_photos ,file_profiles,
file_profiles3,file_statistics,
file_videos)
)
tidyverse syntax
library(dplyr)
library(purrr)
list(file_comments,file_groups,file_photos ,file_profiles,
file_profiles3,file_statistics,
file_videos) %>% reduce(inner_join, by = "owner_id")
By the way, if you prefer left join rather than inner join (the one you intended to use):
add all.x = TRUE argument in merge
use left_join rather than inner_join in tidyverse solution

Ho to run a function (many times) that changes variable (tibble) in global env

I'm a newbie in R, so please have some patience and... tips are most welcome.
My goal is to create tibble that holds a "Full Name" (of a person, that may have 2 to 4 names) and his/her gender. I must start from a tibble that contains typical Male and Female names.
Below I present a minimum working example.
My problem: I can call get_name() multiple time (in 10.000 for loop!!) and get the right answer. But, I was looking for a more 'elegant' way of doing it. replicate() unfortunately returns a vector... which make it unusable.
My doubts: I know I have some (very few... right!!) issues, like the if statement, that is evaluated every time (which is redundant), but I don't find another way to do it. Any suggestion?
Any other suggestions about code struct are also welcome.
Thank you very much in advance for your help.
# Dummy name list
unit_names <- tribble(
~Women, ~Man,
"fem1", "male1",
"fem2", "male2",
"fem3", "male3",
"fem4", "male4",
"fem5", "male5",
"fem6", NA,
"fem7", NA
)
set.seed(12345) # seed for test
# Create a tibble with the full names
full_name <- tibble("Full Name" = character(), "Gender" = character() )
get_name <- function() {
# Get the Number of 'Unit-names' to compose a 'Full-name'
nbr_names <- sample(2:4, 1, replace = TRUE)
# Randomize the Gender
gender <- sample(c("Women", "Man"), 1, replace = TRUE)
if (gender == "Women") {
lim_names <- sum( !is.na(unit_names$"Women"))
} else {
lim_names <- sum( !is.na(unit_names$"Man"))
}
# Sample the Fem/Man List names (may have duplicate)
sample(unlist(unit_names[1:lim_names, gender]), nbr_names, replace = TRUE) %>%
# Form a Full-name
paste ( . , collapse = " ") %>%
# Add it to the tibble (INCLUDE the Gender)
add_row(full_name, "Full Name" = . , "Gender" = gender)
}
# How can I make 10k of this?
full_name <- get_name()

If you pass a larger number than 1 to sample this problem becomes easier to vectorise.
One thing that currently makes your problem much harder is the layout of your unit_names table: you are effectively treating male and female names as individually paired, but they clearly aren’t: hence they shouldn’t be in columns of the same table. Use a list of two vectors, for instance:
unit_names = list(
Women = c("fem1", "fem2", "fem3", "fem4", "fem5", "fem6", "fem7"),
Men = c("male1", "male2", "male3", "male4", "male5")
)
Then you can generate random names to your heart’s delight:
generate_names = function (n, unit_names) {
name_length = sample(2 : 4, n, replace = TRUE)
genders = sample(c('Women', 'Men'), n, replace = TRUE)
names = Map(sample, unit_names[genders], name_length, replace = TRUE) %>%
lapply(paste, collapse = ' ') %>%
unlist()
tibble(`Full name` = names, Gender = genders)
}
A note on style, unlike your function the above doesn’t use any global variables. Furthermore, don’t "quote" variable names (you do this in unit_names$"Women" and for the arguments of add_row). R allows this, but this is arguably a mistake in the language specification: these are not strings, they’re variable names, making them look like strings is misleading. You don’t quote your other variable names, after all. You do need to backtick-quote the `Full name` column name, since it contains a space. However, the use of backticks, rather than quotes, signifies that this is a variable name.

I am not 100% of what you are trying to get, but if I got it right...did you try with mutate at dplyr? For example:
result= mutate(data.frame,
concated_column = paste(column1, column2, column3, column4, sep = '_'))

With a LITTLE help from Konrad Rudolph, the following elegant (and vectorized ... and fast) solution that I was looking. map2 does the necessary trick.
Here is the full working example if someone needs it:
(Just a side note: I kept the initial conversion from tibble to list because the data arrives to me as a tibble...)
Once again thanks to Konrad.
# Dummy name list
unit_names <- tribble(
~Women, ~Men,
"fem1", "male1",
"fem2", "male2",
"fem3", "male3",
"fem4", "male4",
"fem5", "male5",
"fem6", NA,
"fem7", NA
)
name_list <- list(
Women = unit_names$Women[!is.na(unit_names$Women)],
Men = unit_names$Men[!is.na(unit_names$Men)]
)
generate_names = function (n, name_list) {
name_length = sample(2 : 4, n, replace = TRUE)
genders = sample(c('Women', 'Men'), n, replace = TRUE)
#names = lapply(name_list[genders], sample, name_length) %>%
names = map2(name_list[genders], name_length, sample) %>%
lapply(paste, collapse = ' ') %>%
unlist()
tibble(`Full name` = names, Gender = genders)
}
full_name <- generate_names(10000, name_list)

Using mutate in R to rename items in a column

EDIT
I am trying to name a column and rename all items within the column of a dataset:
dataSet <- read.csv(url) %>%
rename("newColumn1" = V1) %>%
mutate(newColumn1 = recode(newColumn1, "oldEntryX" = "newEntryX") %>%
select(dataSet, newColumn1)
And I get this error:
Error in recode(newColumn1, oldEntryX = "newEntryX" :
object 'newColumn1' not found
What am I missing?
The code runs correctly up through the rename function and displays the renamed column correctly, but soon as I include mutate it throws an error.
I have no problem sharing the real code but wanted to generalize it for the crowd.
source info was from https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data

IN the mutate step, you don't need quotes for column names on the lhs of =. Also, there are couple of case mismatches
Assuming the dataset is read correctly, we can
df1 %>%
rename(newColumn1 = V1, newColumn2 = V2) %>%
mutate(newColumn1 = recode(newColumn1, oldEntryX = "newEntryX"),
newColumn2 = recode(newColumn2, oldEntryY = "newEntryY"))
Based on the OP's code there is no closing quote as well "newColumn1
data
set.seed(24)
df1 <- data.frame(V1 = sample(c("oldEntryX", "x", "y"), 10, replace = TRUE),
V2 = sample(c("oldEntryY", "x", "y"), 10, replace = TRUE), stringsAsFactors= FALSE)

you can do this with some simple codes of R programming:
How to read csv file
Syntax :- `read.csv("filename.csv")
by using this command 1st row will be used as header. To improve this fault one should write
data <- read.csv("datafile.csv", header=FALSE)
How to rename the header/Column name:
names(data) <- c("Column1", "Column2", "Column3")
Now your headers are replaced by Column1, Column2 and Column3
Now to change Column1 data you can follow steps
data$Column1 <- c(write down set of values with which you want to replace)
To see the output type
data

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to rename columns in Sparklyr in R? - r

The renaming convention goes the other way around (new name = old name) You are looking for the following: mydata %>% select(V1 = customer, V2 = device_subscriber_id, V3 = user_subscriber_id, V4 = user_id, V5 = location_id)

Of the top of my head you could try customer = mydata$V1 and similar for the other variables (assuming V1,... are column names of mydata).

Related

regex match with fuzzyjoin / dplyr

convert character column and then split it into multiple new boolean columns using r mutate

Error in fix.by (by.x, x): 'by' must define one or more columns as numbers, names or logical data

Ho to run a function (many times) that changes variable (tibble) in global env

Using mutate in R to rename items in a column

Categories

Resources