R fastdummies equivalent in sparkR - r

I have a Spark dataframe with the following data:
categories
1 John
2 Luis
3 Dora
For which I need to create a one hot ending version as:
categories categories_Dora categories_John categories_Luis
1 John 0 1 0
2 Luis 0 0 1
3 Dora 1 0 0
This is the current code I have:
test <- data.frame("SN" = 1:2, "Age" = c(21,15), "Name" = c("John;Luis","Dora"))
df <- as.DataFrame(test)
df_2 = selectExpr(df, "split(Name, ';') AS categories","Name")
dat <- df_2 %>%
mutate(categories=explode(df_2$categories)) %>%
select("categories")
The current solution I have is to convert this to a regular R dataframe,
and apply the fast dummies function. Which works for this case but it wont´t
work properly for a large dataset:
r_df = dat %>%
SparkR::collect()
dummy_r = dummy_cols(r_df)
How can I get the same result using sparkR dataframes?
EDIT:
I can not use sparklyr only sparkR

It can be done with Sparklyr which has many of the feature transformer functions exposed.
library(sparklyr)
test <- data.frame("categories" = c("John", "Luis","Dora"))
sc <- sparklyr::spark_connect(master = "local")
d_tbl <- copy_to(sc, test, overwrite = TRUE)
d_tbl %>%
ft_string_indexer(input_col = "categories", output_col = "cat_num") %>%
mutate(cat_num = cat_num + 1) %>%
ft_one_hot_encoder("cat_num", "cat_onehot") %>%
sdf_separate_column("cat_onehot",
paste("categories", pull(., categories), sep="_")) %>%
select(-cat_num, -cat_onehot)
The output:
# Source: spark<?> [?? x 4]
categories categories_John categories_Luis categories_Dora
<chr> <dbl> <dbl> <dbl>
1 John 0 0 0
2 Luis 0 1 0
3 Dora 0 0 1
The ft_string_indexer generates a column names cat_num which has the a numeric value for each category. Very similar to as.numeric(factor) in R. The +1 is just to have the indexes from 1 to N. ft_one_hot_encoder does the magic at Spark level, the function return a vectorised value like a list with the encoding. The function sdf_separate_column expands the encoding to columns. The paste generates the colnames using the category levels. The select drops unnecessary columns used in the transformation.

Related

How can i add more columns in dataframe by for loop

I am beginner of R. I need to transfer some Eviews code to R. There are some loop code to add 10 or more columns\variables with some function in data in Eviews.
Here are eviews example code to estimate deflator:
for %x exp con gov inv cap ex im
frml def_{%x} = gdp_{%x}/gdp_{%x}_r*100
next
I used dplyr package and use mutate function. But it is very hard to add many variables.
library(dplyr)
nominal_gdp<-rnorm(4)
nominal_inv<-rnorm(4)
nominal_gov<-rnorm(4)
nominal_exp<-rnorm(4)
real_gdp<-rnorm(4)
real_inv<-rnorm(4)
real_gov<-rnorm(4)
real_exp<-rnorm(4)
df<-data.frame(nominal_gdp,nominal_inv,
nominal_gov,nominal_exp,real_gdp,real_inv,real_gov,real_exp)
df<-df %>% mutate(deflator_gdp=nominal_gdp/real_gdp*100,
deflator_inv=nominal_inv/real_inv,
deflator_gov=nominal_gov/real_gov,
deflator_exp=nominal_exp/real_exp)
print(df)
Please help me to this in R by loop.
The answer is that your data is not as "tidy" as it could be.
This is what you have (with an added observation ID for clarity):
library(dplyr)
df <- data.frame(nominal_gdp = rnorm(4),
nominal_inv = rnorm(4),
nominal_gov = rnorm(4),
real_gdp = rnorm(4),
real_inv = rnorm(4),
real_gov = rnorm(4))
df <- df %>%
mutate(obs_id = 1:n()) %>%
select(obs_id, everything())
which gives:
obs_id nominal_gdp nominal_inv nominal_gov real_gdp real_inv real_gov
1 1 -0.9692060 -1.5223055 -0.26966202 0.49057546 2.3253066 0.8761837
2 2 1.2696927 1.2591910 0.04238958 -1.51398652 -0.7209661 0.3021453
3 3 0.8415725 -0.1728212 0.98846942 -0.58743294 -0.7256786 0.5649908
4 4 -0.8235101 1.0500614 -0.49308092 0.04820723 -2.0697008 1.2478635
Consider if you had instead, in df2:
obs_id variable real nominal
1 1 gdp 0.49057546 -0.96920602
2 2 gdp -1.51398652 1.26969267
3 3 gdp -0.58743294 0.84157254
4 4 gdp 0.04820723 -0.82351006
5 1 inv 2.32530662 -1.52230550
6 2 inv -0.72096614 1.25919100
7 3 inv -0.72567857 -0.17282123
8 4 inv -2.06970078 1.05006136
9 1 gov 0.87618366 -0.26966202
10 2 gov 0.30214534 0.04238958
11 3 gov 0.56499079 0.98846942
12 4 gov 1.24786355 -0.49308092
Then what you want to do is trivial:
df2 %>% mutate(deflator = real / nominal)
obs_id variable real nominal deflator
1 1 gdp 0.49057546 -0.96920602 -0.50616221
2 2 gdp -1.51398652 1.26969267 -1.19240392
3 3 gdp -0.58743294 0.84157254 -0.69801819
4 4 gdp 0.04820723 -0.82351006 -0.05853872
5 1 inv 2.32530662 -1.52230550 -1.52749012
6 2 inv -0.72096614 1.25919100 -0.57256297
7 3 inv -0.72567857 -0.17282123 4.19901294
8 4 inv -2.06970078 1.05006136 -1.97102841
9 1 gov 0.87618366 -0.26966202 -3.24919196
10 2 gov 0.30214534 0.04238958 7.12782060
11 3 gov 0.56499079 0.98846942 0.57158146
12 4 gov 1.24786355 -0.49308092 -2.53074800
So the question becomes: how do we get to the nice dplyr-compatible data.frame.
You need to gather your data using tidyr::gather. However, because you have 2 sets of variables to gather (the real and nominal values), it is not straightforward. I have done it in two steps, there may be a better way though.
real_vals <- df %>%
select(obs_id, starts_with("real")) %>%
# the line below is where the magic happens
tidyr::gather(variable, real, starts_with("real")) %>%
# extracting the variable name (by erasing up to the underscore)
mutate(variable = gsub(variable, pattern = ".*_", replacement = ""))
# Same thing for nominal values
nominal_vals <- df %>%
select(obs_id, starts_with("nominal")) %>%
tidyr::gather(variable, nominal, starts_with("nominal")) %>%
mutate(variable = gsub(variable, pattern = ".*_", replacement = ""))
# Merging them... Now we have something we can work with!
df2 <-
full_join(real_vals, nominal_vals, by = c("obs_id", "variable"))
Note the importance of the observation id when merging.
We can grep the matching names, and sort:
x <- colnames(df)
df[ sort(x[ (grepl("^nominal", x)) ]) ] /
df[ sort(x[ (grepl("^real", x)) ]) ] * 100
Similarly, if the columns were sorted, then we could just:
df[ 1:4 ] / df[ 5:8 ] * 100
We can loop over column names using purrr::map_dfc then apply a custom function over the selected columns (i.e. the columns that matched the current name from nms)
library(dplyr)
library(purrr)
#Replace anything before _ with empty string
nms <- unique(sub('.*_','',names(df)))
#Use map if you need the ouptut as a list not a dataframe
map_dfc(nms, ~deflator_fun(df, .x))
Custom function
deflator_fun <- function(df, x){
#browser()
nx <- paste0('nominal_',x)
rx <- paste0('real_',x)
select(df, matches(x)) %>%
mutate(!!paste0('deflator_',quo_name(x)) := !!ensym(nx) / !!ensym(rx)*100)
}
#Test
deflator_fun(df, 'gdp')
nominal_gdp real_gdp deflator_gdp
1 -0.3332074 0.181303480 -183.78433
2 -1.0185754 -0.138891362 733.36121
3 -1.0717912 0.005764186 -18593.97398
4 0.3035286 0.385280401 78.78123
Note: Learn more about quo_name, !!, and ensym which they are tools for programming with dplyr here

How can I group by one variable in terms of status of a different variable in a longitudinal situation in R?

I'm new to R, so please go easy on me... I have some longitudinal data that looks like
Basically, I'm trying to find a way to get a table with a) the number of unique cases that have all complete data and b) the number of unique cases that have at least one incomplete or missing data. The end results would ideally be
df<- df %>% group_by(Location)
df1<- df %>% group_by(any(Completion_status=='Incomplete' | 'Missing'))
Not sure about what you want, because it seems there are something of inconsistent between your request and the desired output, however lets try, it seems you need a kind of frequency table, that you can manage with basic R. At the bottom of the answer you can find some data similar to yours.
# You have two cases, the Complete, and the other, so here a new column about it:
data$case <- ifelse(data$Completion_status =='Complete','Complete', 'MorIn')
# now a frequency table about them: if you want a data.frame, here we go
result <- as.data.frame.matrix(table(data$Location,data$case))
# now the location as a new column rather than the rownames
result$Location <- rownames(result)
# and lastly a data.frame with the final results: note that you can change the names
# of the columns but if you want spaces maybe a tibble is better
result <- data.frame(Location = result$Location,
`Number.complete` = result$Complete,
`Number.incomplete.missing` = result$MorIn)
result
Location Number.complete Number.incomplete.missing
1 London 0 1
2 Los Angeles 0 1
3 Paris 3 1
4 Phoenix 0 2
5 Toronto 1 1
Or if you prefere a dplyr chain:
data %>%
mutate(case = ifelse(data$Completion_status =='Complete','Complete', 'MorIn')) %>%
do( as.data.frame.matrix(table(.$Location,.$case))) %>%
mutate(Location = rownames(.)) %>%
select(3,1,2) %>%
`colnames<-`(c("Location","Number of complete ", "Number of incomplete or"))
Location Number of complete Number of incomplete or
1 London 0 1
2 Los Angeles 0 1
3 Paris 3 1
4 Phoenix 0 2
5 Toronto 1 1
With data:
# here your data (next time try to put them in an usable way in the question)
data <- data.frame( ID = c("A1","A1","A2","A2","B1","C1","C2","D1","D2","E1"),
Location = c('Paris','Paris','Paris','Paris','London','Toronto','Toronto','Phoenix','Phoenix','Los Angeles'),
Completion_status = c('Complete','Complete','Incomplete','Complete','Incomplete','Missing',
'Complete','Incomplete','Incomplete','Missing'))

Extract time series from netCDF in R

I created this file
using TRMM_3B42_Daily product over 1998-01-01 to 1998-12-31. This is the script I used in R:
lon=seq(-91.875,-86.875,by= 0.25)
lat=seq(13.875,16.875,by= 0.25)
x_dim <- ncdim_def( "lon", "degrees_east", lon, create_dimvar=TRUE)
y_dim <- ncdim_def( "lat", "degrees_north", lat, create_dimvar=TRUE)
t_dim <- ncdim_def( "time", "days since 1997-12-31 12:00:00.0 -0:00", 1:365, unlim=FALSE)
mv=9999.900390625
precipitation_var <- ncvar_def("precipitation", "mm", list(y_dim,x_dim,t_dim), mv)
nrow = 13
ncol = 21
NA.matrix=matrix(rep(NA,nrow*ncol))
precip=array(NA.matrix,c(nrow,ncol, 1))
for (i in 1:length(test01)){precip_nc=nc_open(test01[i])
precip_get_nc=ncvar_get(precip_nc,"precipitation")
precip=abind(precip,precip_get_nc)}
precip=precip[,,-1]
PRECIPITATION_nc = nc_create("PRECIPITATION_1998.nc", precipitation_var)
precipitation_nc_put=ncvar_put (PRECIPITATION_nc, precipitation_var, precip)
nc_close(PRECIPITATION_nc)
Following this link I tried extracting the values in order to plot a time series but it seems I am averaging the values of two cells instead of just extracting the values of a single cell. How do I fix this? Is there a way to create a loop so that it extracts the values of different cells? (in this case it would be 13 x 21 = 273)
b <- brick('PRECIPITATION_1998.nc')
be <- crop(b, extent(13.875, 14.125, -91.875,-91.625))
a <- aggregate(be, dim(be)[2:1], na.rm=TRUE)
v <- values(a)
write.csv(v, 'precip.csv', row.names=FALSE)
Also two other problems I found where that the dates in the excel file have an X in front and that the values are shown horizontally instead of vertically. Any help would be greatly appreciated!! Thanks
extraction of points data can be easily accomplished by creating a SpatialPoints object containing the point from which you want to extract data, followed by an extract operation.
Concerining the other topics: The "X"s are added because column names can not start with numerals, so a character is added. The horizontal ordering can be easily changed after extraction with some transposing
This, for example, should work (It solves also the "X"s problem and changes the format to "column like"):
library(raster)
library(stringr)
library(lubridate)
library(tidyverse)
b <- brick('/home/lb/Temp/buttami/PRECIPITATION_1998.nc')
lon = c(-91.875,-91.625) # Array of x coordinates
lat <- c(13.875, 14.125) # Array of y coordinates
points <- SpatialPoints(cbind(lat,lon)), # Build a spPoints object
# Etract and tidy
points_data <- b %>%
raster::extract(points, df = T) %>%
gather(date, value, -ID) %>%
spread(ID, value) %>% # Can be skipped if you want a "long" table
mutate(date = ymd(str_sub(names(b),2))) %>%
as_tibble()
points_data
# A tibble: 365 × 3
date `1` `2`
<date> <dbl> <dbl>
1 1998-01-01 0 0
2 1998-01-02 0 0
3 1998-01-03 0 0
4 1998-01-04 0 0
5 1998-01-05 0 0
6 1998-01-06 0 0
7 1998-01-07 0 0
8 1998-01-08 0 0
9 1998-01-09 0 0
10 1998-01-10 0 0
# ... with 355 more rows
plot(points_data$date,points_data$`1`)

Vectorization in R

I am new to R, and I have researched vectorization. But I am still trying to train my mind to think in vectorization terms. Very often examples of vectorization instead of loops are either too simple, so it's difficult for me to generalize them, or not present at all.
Can anyone suggest how I can vectorize the following?
Model2 <- subset(Cor.RMA, MODEL == Models.Sort[2,1])
RCM2 <- count(Model2$REPAIR_CODE)
colnames(RCM2) <- c("REPAIR_CODE", "FREQ")
M2M <- merge(RCM.Sort, RCM2, by = "REPAIR_CODE", all.x = TRUE)
M2M.Sort <- M2M[order(M2M$FREQ.x, decreasing = TRUE), ]
M2M.Sort[is.na(M2M.Sort)] <- 0
In the above code, each "2" needs to run from 2 to 85
writeWorksheetToFile(file="CL2 - TC - RC.xlsx",
data = M2M.Sort[ ,c("FREQ.y")],
sheet = "RC by Model",
clearSheets = FALSE,
startRow = 6,
startCol = 6)
In the above code, "data" should from from "M2M..." to "M85M..." and "startCol" should run from 6 to 89 for an Excel printout.
The data frame this comes from (Cor.RMA) has columns "MODEL", "REPAIR_CODE", and others that are unused.
RCM.Sort is a frequency table of each "REPAIR_CODE" across all models that I use as a Master list to adjoin Device-specific Repair Code counts. (left-join: all.x = TRUE)
Models.Sort is a frequency table I generated using the "count" function from the plyr package, so I can create subsets for each MODEL.
Then I merge a list of each "REPAIR_CODE" that I generated using the "unique" function.
Sample Data:
CASE_NO DEVICE_TYPE MODEL TRIAGE_CODE REPAIR_CODE
12341 Smartphone X TC01 RC01
12342 Smartphone Y TC02 RC02
12343 Smartphone Z TC01, TC05 RC05
12344 Tablet AA TC02 RC37
12345 Wearable BB TC05 RC37
12346 Smartphone X TC07 RC01
12347 Smartphone Y TC04 RC02
I very much appreciate your time and effort if you are willing to help.
Alright, this is not what your original script did, but here goes:
models <- c("X","Y","Z","AA","BB") # in your case it would be Models.Sort[2:85,1]
new <- Cor.RMA[Cor.RMA$MODEL %in% models,]
new2 <- aggregate(new$REPAIR_CODE, list(new$MODEL), table)
temp <- unlist(new2[[2]])
temp <- temp[, order(colSums(temp), decreasing = T)]
out <- data.frame(group=new2[,1], temp)
out <- out[order(rowSums(out[,-1]), decreasing = T),]
out
# group RC01 RC02 RC37 RC05
# 3 X 2 0 0 0
# 4 Y 0 2 0 0
# 1 AA 0 0 1 0
# 2 BB 0 0 1 0
# 5 Z 0 0 0 1
You can then write it easily to an xlsx file, e.g. with:
require(xlsx)
xlsx:::xlsx.write(out,"test.xlsx",row.names=F)
Edit: Added sorting.
Edit2: Fixed sorting.

Count based on multiple conditions from other data.frame

I am migrating analysis from Excel to R, and would like some input on how best to perform something similar to Excel's COUNTIFS in R.
I have a two data.frames, statedf and memberdf.
statedf=data.frame(state=c('MD','MD','MD','NY','NY','NY'), week = 5:7)
memberdf=data.frame(memID = 1:15, state = c('MD','MD','NY','NY','MD'),
finalweek = c(3,3,5,3,3,5,3,5,3,5,6,5,2,3,5),
orders = c(1,2,3))
This data is for a subscription-based business. I would like to know the number of members who newly lapsed for each week/state combo in statedf, where newly lapse is defined by statedf$week - 1 = memberdf$finalweek. Further I would like to have separate counts for each order value (1,2,3).
The desired output would look like
out <- data.frame(state=c('MD','MD','MD','NY','NY','NY'), week = 5:7,
oneorder = c(0,1,0,0,0,0),
twoorder = c(0,0,1,0,1,0),
threeorder = c(0,3,0,0,1,0))
I asked (and got a great response for) a simpler version of this question yesterday - the answers revolved around creating a new data.frame based on member.df. However, I need to append the data to statedf, because statedf has member/week combos that don't exist in memberdf, and vice versa. If this was in Excel, I'd use COUNTIFS but am struggling for a solution in R.
Thanks.
Here is a solution with the dplyr and tidyr packages:
library(tidyr) ; library(dplyr)
counts <- memberdf %>%
mutate(lapsedweek = finalweek + 1) %>%
group_by(state, lapsedweek, orders) %>%
tally()
counts <- counts %>% spread(orders, n, fill = 0)
out <- left_join(statedf, counts, by = c("state", "week" = "lapsedweek"))
out[is.na(out)] <- 0 # convert rows with all NAs to 0s
names(out)[3:5] <- paste0("order", names(out)[3:5]) # rename columns
We could create a new variable ('week1') in the 'statedf' dataset, merge the 'memberdf' with 'statedf', and then reshape from 'long' to 'wide' format with dcast. I changed the 'orders' column to match the column names in the 'out'.
statedf$week1 <- statedf$week-1
df1 <- merge(memberdf[-1], statedf, by.x=c('state', 'finalweek'),
by.y=c('state', 'week1'), all.y=TRUE)
lvls <- paste0(c('one', 'two', 'three'), 'order')
df1$orders <- factor(lvls[df1$orders],levels=lvls)
library(reshape2)
out1 <- dcast(df1, state+week~orders, value.var='orders', length)[-6]
out1
# state week oneorder twoorder threeorder
#1 MD 5 0 0 0
#2 MD 6 1 0 3
#3 MD 7 0 1 0
#4 NY 5 0 0 0
#5 NY 6 0 1 1
#6 NY 7 0 0 0
all.equal(out, out1)
#[1] TRUE

Resources