Rename multiple columns at once in SparkR DataFrame

Rename multiple columns at once in SparkR DataFrame - r

How can I rename multiple columns in a SparkR DataFrame at one time instead of calling withColumnRenamed() multiple time? For example, let's say I want to rename the columns in the DataFrame below to name and birthdays, how would I do so without calling withColumnRenamed() twice?
team <- data.frame(name = c("Thomas", "Bill", "George", "Randall"),
surname = c("Johnson", "Clark", "Williams", "Yosimite"),
dates = c('2017-01-05', '2017-02-23', '2017-03-16', '2017-04-08'))
team <- createDataFrame(team)
team <- withColumnRenamed(team, 'surname', 'name')
team <- withColumnRenamed(team, 'dates', 'birthdays')

Standard R methods apply here - you can simply reassign colnames:
colnames(team) <- c("name", "name", "birthdays")
team
SparkDataFrame[name:string, name:string, birthdays:string]
If you know the order you could skip full list and
colnames(team)[colnames(team) %in% c("surname", "dates")] <- c("name", "birthdays")
You'll probably want to to avoid duplicate names though.

Related

How to not include row name when using rowMeans for dataframe in R

temp = as.data.frame(t(marks))
rownames(temp) = c("John", "Mary", "Mark", "June", "Claire", "Anthony")
names(temp) = c("Module1", "Module2","Module3","Module4","Module5")
rowMeans(temp["June",])
Here's the data frame
The output of the rowMeans include row name, is there any way to not include that. I only want the value.
the current output is
June
87.2

We can use unname
unname(rowMeans(temp["June",]))
or another option is as.vector which strips off the attributes as well
as.vector(rowMeans(temp["June",]))

Renaming Data Frame Column Names using Partial Matching

If I have the code at the bottom of the post, how can I replace the column names of df1 with the second column of df2 using partial matching of df2's first column? The output should look like df3. The entirety of my data frame is filled with many other names besides .length (i.e. CS1.1.width, CS2.12.height, etc), but the CS#.#. always remains in the name.
I then need to remove the ".length" from the colnames.
I have tried using pmatch below for the first part of the question, but the output is not correct.
names(df1) <- df2$new[pmatch(names(df1), df2$partial_atch)]
How would I go about this? Thanks.
old <- c("CS1.1.length", "CS1.7.length", "CS1.10.length", "CS1.12.length", "CS2.4.length", "CS2.6.length", "CS2.9.length", "CS2.11.length", "CS1.1.height")
df1 <- data.frame()
for (k in old) df1[[k]] <- as.character()
new <- c("Bob", "Alex", "Gary", "Taylor", "Tom", "John", "Pat", "Mary")
partial_match <- c("CS1.1", "CS1.7", "CS1.10", "CS1.12", "CS2.4", "CS2.6", "CS2.9", "CS2.11")
df2 <- data.frame(Partial_Match = partial_match, Name = new)
new1 <- c("Bob.length", "Alex.length", "Gary.length", "Taylor.length", "Tom.length", "John.length", "Pat.length", "Mary.length", "Bob.height")
df3 <- data.frame()
for (k in new) df3[[k]] <- as.character()
Edit: The number of columns in df1 is greater than the number of elements in partial_match, so added an additional column in df1 as example.

Here's an option with str_replace from the stringi package:
This works because you can use a vector of pattern = to replace with a matching replacement =.
We need to paste on the trailing . because this prevents CS1.1 replacing CS1.11 and CS1.10.
library(stringi)
stri_replace_all_regex(names(df1),
pattern = paste0(as.character(df2$Partial_Match),"\\."),
replacement = paste0(as.character(df2$Name),"\\."),
vectorize_all = FALSE)
#[1] "Bob.length" "Alex.length" "Gary.length" "Taylor.length" "Tom.length" "John.length" "Pat.length"
#[8] "Mary.length"

Subset dataframe with keywords

I have a dataframe consisting of twitter data (ID number, follower_count, clean_text). I am interested in dividing my dataframe into two subsets: one where keywords are present, and one where keywords are not present.
For example, I have the keywords stored as a value:
KeyWords <- c("abandon*", "abuse*", "agitat*" ,"attack*", "bad", "brutal*",
"care", "caring", "cheat*", "compassion*", "cruel*", "damag*",
"damn*", "destroy*", "devil*", "devot*", "disgust*", "envy*",
"evil*", "faith*","fault*", "fight*", "forbid*", "good", "goodness",
"greed*", "gross*", "hate", "heaven*", "hell", "hero*", "honest*",
"honor*", "hurt*","ideal*", "immoral*", "kill*", "liar*","loyal*",
"murder*", "offend*", "pain", "peace*","protest", "punish*","rebel*",
"respect", "revenge*", "ruin*", "safe*", "save", "secur*", "shame*",
"sin", "sinister", "sins", "slut*", "spite*", "steal*", "victim*",
"vile", "virtue*", "war", "warring", "wars", "whore*", "wicked*",
"wrong*", "benefit*", "harm*", "suffer*","value*") %>% paste0(collapse="|")
And I have made a subset (Data2) of my original dataframe (Data1) where Data2 consists of only the observations in Data1 where one or more of the keywords are present in the clean_text column. Like so:
Data2 <- Data1[with(Data1, grepl(paste0("\\b(?:",paste(KeyWords, collapse="|"),")\\b"), clean_text)),]
Now, I want to make Data3 where only the observations in Data1 where the keywords are not present in the clean_text column. Is there a way to do the inverse of my keyword subsetting above? Or, can I substract my Data2 from Data1 to get my new subset, Data3?

The "inverse" operator in R is ! - this will flip TRUE to FALSE and vice versa. So, with your example, what you're looking for is
Data3 <- Data1[!with(Data1, grepl(paste0("\\b(?:",paste(KeyWords, collapse="|"),")\\b"), clean_text)),]

Unlist column to create unique row in dataframe

I am faced with the following R transformation issue.
I have the following dataframe:
test_df <- structure(list(word = c("list of XYZ schools",
"list of basketball", "list of usa"), results = c("58", "151", "29"), key_list = structure(list(`coRq,coG,coQ,co7E,coV98` = c("coRq", "coG", "coQ", "co7E", "coV98"), `coV98,coUD,coHF,cobK,con7` = c("coV98","coUD", "coHF", "cobK", "con7"), `coV98,coX7,couC,coD3,copW` = c("coV98", "coX7", "couC", "coD3", "copW")), .Names = c("coRq,coG,coQ,co7E,coV98", "coV98,coUD,coHF,cobK,con7", "coV98,coX7,couC,coD3,copW"))), .Names = c("word", "results", "key_list"), row.names = c(116L, 150L, 277L), class = "data.frame")
In short there are three columns, unique on "word" and then a corresponding "key_list" that has a list of keys comma separated. I am interested in creating a new data frame where each key is unique and the word information is duplicated as well as the result information.
So a dataframe that looks as follows:
key word results
coV98 "list of XYZ schools" 58
coRq "list of XYZ schools" 58
coV98 "list of basketball" 151
coV98 "list of usa" 29
And so on for all the keys, so I would like to expand the keys unlist them and then reshape into a dataframe with repeating words and other columns.
I have tried a bunch of the following:
Created a unique list of keys and then attempted to grep for each of those keys in the column and loop through to create a new smaller dataframe and then rbind those together, the resulting dataframe however does not contain the key column:
keys <- as.data.frame(table(unname(unlist(test_df$key_list))))
ttt <- lapply(keys, function(xx){
idx <- grep(xx, test_df$key_list)
df <- all_data_sub[idx,]})
final_df <- do.call(rbind, ttt)
I have also played around with unlisting and reshaping, but I am not getting the right combination.
Any advice would be great!
thanks

May be we can use listCol_l from splitstackshape
library(splitstackshape)
listCol_l(test_df, 'key_list')[]

In case a base R solution is helpful for someone:
do.call(rbind, lapply(seq_along(test_df$key_list), function(i) {
merge(test_df$key_list[[i]], test_df[i,-3], by=NULL)
}))

Seeking fast or automated method for naming many new data.table columns in R

I have a large dataset, 3000x400. I need to created new columns that are means of the existing columns subsetted by a variable constituency. I have a list of new columns names that I want to use to name the new columns, below called newNames. But I can only figure out how to name columns when I directly type the desired new name.
What I currently do:
set.seed(1)
dataTest = data.table(turnout_avg = rnorm(20), urban_avg = rnorm(20,5,2), Constituency = c("A","B","C","D"), key = "Constituency")
oldColumnNames = c( "turnout_avg" , "urban_avg")
newNames = c( "turnout" , "urban")
# Here's my problem, naming these new columns
comm_means_by_district = cbind(
dataTest[,list(Const_turnout = mean(na.omit(get(oldColumnNames[[1]])))), by= Constituency],
dataTest[,list(Const_urban = mean(na.omit(get(oldColumnNames[[2]])))),by= Constituency])
In reality, I want to create much more than two new columns. So I cannot feasibly type Const_turnout, Const_urban, etc. for all new columns.
I've have tried two ideas, but neither works,
1.
dataTest[,list(paste("district", newNames[1], sep="_") = mean(na.omit(get(refColNames[[1]])))), by= Constituency]
Or 2.
dataTest[,list(paste(oldColumnNames[1], "constMean", sep="_") = mean(na.omit(get(refColNames[[1]])))), by= Constituency]

first get the mean of all the columns in one go
DT <- dataTest[,lapply(.SD,function(x) mean(na.omit(x))), by= Constituency]
then change the colnames afterwards
setnames(DT,colnames(DT),vector_of_newnames)

Why is it important to change the names in the same line where you apply the function? I would just first calculate the constituency-wise means and set the column names after. Here's how this would look like:
dt <- dataTest[, lapply(oldColumnNames, function(x) mean(na.omit(get(x)))),
by=Constituency]
setnames(dt, c("Constituency", paste("Const", newNames, sep="_")))
dt