How to drop all NA columns in a SparkDataFrame with SparkR?

How to drop all NA columns in a SparkDataFrame with SparkR? - r

Once again, I'm facing a problem that I can't transcribe under SparkR.
I have a SparkDataFrame which some columns contain only NAs, and I want to delete all these columns.
I discovered SparkR recently, I think I'm far from understanding all its operation, but it's very frustrating to block on a point yet not so complicated...
Here is the reprex and the way I am doing it in R :
library(data.table)
df <- data.frame(V1 = base::sample(1:10,5), V2 = base::rep(NA,5), V3 = base::sample(1:10,5), V4 = base::rep(NA,5), V5 = base::rep(NA,5), X = runif(n = 5, min = 0, max = 5))
sdf <- createDataFrame(df)
dt <- setDT(df)
na.lst <- sapply(dt, function(x) all(is.na(x)))
dt[, which(na.lst) := NULL]
Thanks !

You can consider the following approach
library(SparkR)
df <- data.frame(V1 = base::sample(1 : 10,5),
V2 = base::rep(NA,5),
V3 = base::sample(1 : 10,5),
V4 = base::rep(NA,5),
V5 = base::rep(NA,5),
X = runif(n = 5, min = 0, max = 5))
sdf <- createDataFrame(df)
col_Names <- colnames(sdf)
nb_Col_Names <- length(col_Names)
vec_Bool <- rep(FALSE, nb_Col_Names)
for(i in 1 : nb_Col_Names)
{
dim_Temp <- dim(dropna(select(sdf, col = col_Names[i]), how = "all"))
if(dim_Temp[1] != 0) vec_Bool[i] <- TRUE
}
col <- col_Names[vec_Bool]
newdf <- select(sdf, col = col)
as.data.frame(newdf)
V1 V3 X
1 6 1 2.286716
2 10 3 3.532843
3 2 9 2.030851
4 8 6 3.304420
5 4 10 1.596272
See Remove columns with only NA values with SparkR

Related

How to add multiple calculated columns to a SparkDataFrame using SparkR?

Now I'm stuck on a rather basic case, but I can't find a clever solution with SparkR ...
From N columns in my SparkDataFrame, I need to create N new calculated columns.
df <- data.frame(V1 = base::sample(1:10,5),
V2 = base::sample(1:10,5),
V3 = base::sample(1:10,5),
V4 = base::sample(1:10,5),
V5 = base::sample(1:10,5),
X = runif(n = 5, min = 0, max = 5))
sdf <- createDataFrame(df)
sdf <- withColumn(sdf, "V1_X", column("X") / column("V1"))
sdf <- withColumn(sdf, "V2_X", column("X") / column("V2"))
sdf <- withColumn(sdf, "V3_X", column("X") / column("V3"))
sdf <- withColumn(sdf, "V4_X", column("X") / column("V4"))
sdf <- withColumn(sdf, "V5_X", column("X") / column("V5"))
Basically, I want to apply a function to my vector/list of columns names.
Easy in R. In SparkR, I am able to lapply a function, but I modify the original columns. Something escapes me...!?
Thanks !

Maybe you can consider the following approach
library(SparkR)
df <- data.frame(V1 = base::sample(1:10,5),
V2 = base::sample(1:10,5),
V3 = base::sample(1:10,5),
V4 = base::sample(1:10,5),
V5 = base::sample(1:10,5),
X = runif(n = 5, min = 0, max = 5))
sdf <- createDataFrame(df)
col_Names <- colnames(sdf)
nb_Col_Ratio <- length(col_Names) - 1
for(i in 1 : nb_Col_Ratio)
{
new_Col_Name <- paste(col_Names[i], "X")
sdf <- withColumn(sdf, new_Col_Name, column("X") / column(col_Names[i]))
}
print(as.data.frame(sdf))

I usually prefer doing these kind of tasks using SparkR::select command, instead of looping over multiple withColumn statements. The same logic can be used in Python using Python's list comprehension. See the modified version of your code below:
df <- data.frame(V1 = base::sample(1:10,5),
V2 = base::sample(1:10,5),
V3 = base::sample(1:10,5),
V4 = base::sample(1:10,5),
V5 = base::sample(1:10,5),
X = runif(n = 5, min = 0, max = 5))
sdf <- SparkR::createDataFrame(df)
# sdf <- SparkR::withColumn(sdf, "V1_X", SparkR::column("X") / SparkR::column("V1"))
# sdf <- SparkR::withColumn(sdf, "V2_X", SparkR::column("X") / SparkR::column("V2"))
# sdf <- SparkR::withColumn(sdf, "V3_X", SparkR::column("X") / SparkR::column("V3"))
# sdf <- SparkR::withColumn(sdf, "V4_X", SparkR::column("X") / SparkR::column("V4"))
# sdf <- SparkR::withColumn(sdf, "V5_X", SparkR::column("X") / SparkR::column("V5"))
loop_on_cols <- SparkR::colnames(sdf)[SparkR::colnames(sdf)!="X"]
sdf2 <- SparkR::select(
sdf,
c(
SparkR::colnames(sdf),
lapply(
loop_on_cols,
function(c) {
SparkR::alias(SparkR::column("X")/SparkR::column(c), paste0(c,"_X"))
}
)
)
)
SparkR::head(sdf2)

Can I "automate" uniting columns that are V1:V148?

I have a dataset that has 148 columns. I need to combine them in groups of 4.
For example V1,V2,V3,V4 =X1
V1 <-c(0,3)
V2 <-c(F,F)
V3 <-c (2,4)
V4 <-c(A,C)
X1
0F2A
3F4C
I know I can use
```{r}
new_data_4 <-new%>%
unite(V1:V4)%>%
unite(V5:V8)%>%
unite(V9:V12)%>%
unite(V13:V16)
```
with great success but I would like to make this function. My wish is that it can count the number of columns and do it automatically without hardcoding the numbers. I have more files to go over. I have looked around StackOverFlow and found LOTS of examples with specific problems that don't really jive with what I have.
I have tried:
```{r}
unite_columns <-function(x){
united_cols <-tidyr::unite(x, seq_along(1, ncol(x), 4), seq_along(4, ncol(x), 4))
return(united_cols)
}
```
and
```{r}
unite_columns <-function(x){
united_cols <-unite(x, seq(1, ncol(x), 4), seq(4, ncol(x), 4))
return(united_cols)
}
```
I was thinking I could use a similar tactic that is used to merge strings but it did not work.
Any help would be greatly appreciated. TIA

You can use split.default to split the columns in every 4 columns and paste the values rowwise using do.call.
result <- data.frame(sapply(split.default(new, ceiling(seq_along(new)/4)),
function(x) do.call(paste0, x)))
# X1 X2
#1 0F2A 8A4R
#2 3F4C 9B5K
data
new <- data.frame(V1 = c(0,3), V2 = c("F","F"), V3 = c(2,4), V4 = c("A","C"),
V5 = c(8, 9), V6 = c("A", "B"), V7 = c(4, 5), V8 = c("R", "K"))
new
# V1 V2 V3 V4 V5 V6 V7 V8
#1 0 F 2 A 8 A 4 R
#2 3 F 4 C 9 B 5 K

Using Reduce.
sapply(1:(ncol(new)/4), function(f) Reduce(paste0, new[1:4*f]))
# [,1] [,2] [,3]
# [1,] "0F2A" "FAAR" "2A1D"
# [2,] "3F4C" "FCBK" "4B2S"
If you want a data frame:
as.data.frame(sapply(1:(ncol(new)/4), function(f) Reduce(paste0, new[1:4*f])))
# V1 V2 V3
# 1 0F2A FAAR 2A1D
# 2 3F4C FCBK 4B2S
Data
new <- structure(list(V1 = c(0, 3), V2 = c("F", "F"), V3 = c(2, 4),
V4 = c("A", "C"), V5 = c(8, 9), V6 = c("A", "B"), V7 = c(4,
5), V8 = c("R", "K"), V9 = c(1, 2), V10 = c("C", "D"), V11 = c(9,
8), V12 = c("D", "S")), class = "data.frame", row.names = c(NA,
-2L))

Another purrr and tidyr option could be:
imap_dfc(.x = split.default(df, ceiling(1:ncol(df)/4)),
~ .x %>%
unite(col = !!paste0("X", .y), everything(), sep = ""))
X1 X2
1 0F2A 8A4R
2 3F4C 9B5K

You can also do like this, with tidyverse only. I have used purrr::map2dfc to do this.
First argument of map is a sequence of length n/4 (needless to say you may use ncol(new) instead of storing n separately.
and second argument is names of columns to be generated.
At each iteration, the map function will take out four columns as per integer division function used,
name it as per second argument
and then select that column only.
all the lists generated at each iteration of map function will be col-bind and therefore map2_dfc has been used.
I think that is pretty clear.
library(tidyverse)
#say n is 148
n <- 148
map2_dfc(seq_len(n/4), paste0("X", seq_len(n/4)), ~new %>%
unite(!!.y,
seq_along(new)[(3 + seq_along(new)) %/% 4 == .x],
sep = "") %>% select(all_of(.y))
)
Check it for data generated by #Ronak
n <- 8
map2_dfc(seq_len(n/4), paste0("X", seq_len(n/4)), ~new %>%
unite(!!.y,
seq_along(new)[(3 + seq_along(new)) %/% 4 == .x],
sep = "") %>% select(all_of(.y))
)
X1 X2
1 0F2A 8A4R
2 3F4C 9B5K
Or on data generated by #jay.sf
n <- 12
map2_dfc(seq_len(n/4), paste0("X", seq_len(n/4)), ~new %>%
unite(!!.y,
seq_along(new)[(3 + seq_along(new)) %/% 4 == .x],
sep = "") %>% select(all_of(.y))
)
X1 X2 X3
1 0F2A 8A4R 1C9D
2 3F4C 9B5K 2D8S

Correlations between dataframe and list of dataframes in R

I want to calculate correlations between a dataframe and a list of dataframes. Here is my sample:
library(lubridate)
v1 = seq(ymd('2000-05-01'),ymd('2000-05-10'),by='day')
v2 = seq(2,20, length = 10)
v3 = seq(-2,7, length = 10)
v4 = seq(-6,3, length = 10)
df1 = data.frame(Date = v1, Tmax = v2, Tmean = v3, Tmin = v4)
v1 = seq(ymd('2000-05-01'),ymd('2000-05-10'),by='day')
v2 = seq(3,21, length = 10)
v3 = seq(-3,8, length = 10)
v4 = seq(-7,4, length = 10)
abc = data.frame(Date = v1, ABC_Tmax = v2, ABC_Tmean = v3, ABC_Tmin = v4)
v1 = seq(ymd('2000-05-01'),ymd('2000-05-10'),by='day')
v2 = seq(4,22, length = 10)
v3 = seq(-4,9, length = 10)
v4 = seq(-8,5, length = 10)
def = data.frame(Date = v1, DEF_Tmax = v2, DEF_Tmean = v3, DEF_Tmin = v4)
v1 = seq(ymd('2000-05-01'),ymd('2000-05-10'),by='day')
v2 = seq(2,20, length = 10)
v3 = seq(-2,8, length = 10)
v4 = seq(-6,3, length = 10)
ghi = data.frame(Date = v1, GHI_Tmax = v2, GHI_Tmean = v3, GHI_Tmin = v4)
df2 <-list(abc, def, ghi)
names(df2) = c("ABC", "DEF", "GHI")
I want to have all correlation coefficients between df1 and df2, but only columnswise.
For example:
df1$Tmax and all df2*Tmax columns
df1$Tmean and all df2*Tmean columns
df1$Tmin and all df2*Tmin columns
I know that I can access all Tmax columns like that:
lapply(df2, "[[", 2)
I know how to calculate the correlation between 2 single values:
cor.test(df1$Tmax, df2$ABC$ABC_Tmax, method = "spearman")
But how can I do it for all columns at once? I tried this, which is not working:
cor.test(df1$Tmax, lapply(df2, "[[", 2), method = "spearman")
Any ideas?

You could use lapply in combination with mapply to apply cor.test and extract a specific value from the test. For example, to get p.value and estimate we can do
lapply(2:4, function(i) mapply(function(x, y) {
a <- cor.test(x, y, method = "spearman")
c(setNames(a$p.value, "pvalue"), a$estimate)
}, lapply(df2, "[[", i), df1[i]))

function returns relating columns with the corresponding column removed

This is my code so far:
record <- function(input, string){
filter(input, input$race == string |
input$flag == string)
}
Please help

You could try which. Using data from #RuiBarradas:
set.seed(1234)
recordings <- data.frame(V1 = sample(LETTERS, 10),
V2 = sample(LETTERS, 10),
V3 = letters[1:10], stringsAsFactors = FALSE)
records <- function(recordings, string){
rws <- which(recordings == string, arr.ind = TRUE)[,1]
cls <- which(recordings == string, arr.ind = TRUE)[,2]
recordings <- recordings[rws, -cls, drop = FALSE]
return(recordings)
}
For A, it would return:
records(recordings, "A")
V2 V3
7 F g
For X:
records(recordings, "X")
V3
4 d
5 e
This assumes that no value is present in all columns.
If you need to only know the corresponding row values:
records <- function(recordings, string){
return(which(recordings == string, arr.ind = TRUE)[,1])
}
records(recordings, "X")
[1] 4 5

See if the following is what you want.
First I will make up a dataset, since you have not posted one.
set.seed(1234) # Make the results reproducible
recordings <- data.frame(V1 = sample(LETTERS, 10),
V2 = sample(letters, 10),
V3 = sample(4, 10, TRUE))
Now the function.
records <- function(DF, string){
inx <- DF == string
i <- apply(inx, 1, function(x) Reduce('||', x))
DF[i, which(colSums(!inx) == nrow(DF)), drop = FALSE]
}
records(recordings, "A")
# V2 V3
#7 f 3
records(recordings, "x")
# V1 V3
#5 S 1

data.table recode in selected columns

So I'm struggling with data.table. How do I make v1 and v3 numeric?
dt = data.table(v1 = c('1','2','3'), v2 = c(1,2,3), v3 = c('1','2','3'))
dt[,c(1,3), with = F] = lapply(dt[,c(1,3), with = F], as.numeric)

Try this:
dt <- data.table(v1 = c('1','2','3'), v2 = c(1,2,3), v3 = c('1','2','3'))
dt[,':='(v1=as.numeric(v1),v3=as.numeric(v3))]
sapply(dt,class)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to drop all NA columns in a SparkDataFrame with SparkR? - r

Related

How to add multiple calculated columns to a SparkDataFrame using SparkR?

Can I "automate" uniting columns that are V1:V148?

Correlations between dataframe and list of dataframes in R

function returns relating columns with the corresponding column removed

data.table recode in selected columns

Categories

Resources