SQL/SAS equivalent code in Microsoft-r - r

I wanted to convert one of my Proc SQL/SAS code in Rev R/Microsoft-r
here is my sample code
proc sql;
create table GENDER_YEAR as
select YEAR,GENDER,count(distinct CARD_NO) as CM_COUNT,sum(SPEND) as TOTAL_SPEND, sum(case when SPEND GT 0 then 1 else 0 end) as NO_OF_TRANS
from ABC group by YEAR,GENDER;
quit;
I'm trying below code in Rev R
library("RevoPemaR")
byGroupPemaObj <- PemaByGroup()
GENDER_cv_grouped <- pemaCompute(pemaObj = byGroupPemaObj, data = Merchant_Trans,groupByVar = "GENDER",computeVars = c("LOCAL_SPEND"),fnList = list(sum = list(FUN = sum, x = NULL)))
it Calculate only on thing at a time, but i need Distinct Count of CARD_NO, SUM of SPEND, and No of no zero Rows for Spend as Trans for each segment of YEAR & Gender.
Output Should look like below
YEAR GENDER CM_COUNT TOTAL_SPEND NO_OF_TRANS
YEAR1 M 23 120 119
YEAR1 F 21 110 110
YEAR2 M 20 121 121
YEAR2 F 35 111 109
Looking forward help on this.

The easiest way to go about this it to concatenate the columns into a single column and use that. It seems that most both dplyrXdf and RevoPemaR do not support group by with 2 variables yet.
The way to do this would be by adding a rxDataStep on top which creates this variable first and then groups by it. Some approximate code for this is:
library("RevoPemaR")
byGroupPemaObj <- PemaByGroup()
rxDataStep(inData = Merchant_Trans, outFile = Merchant_Trans_Groups,
transform = list(year_gender = paste(YEAR, GENDER,))
GENDER_cv_grouped <- pemaCompute(pemaObj = byGroupPemaObj,
data = Merchant_Trans_Groups, groupByVar = "GENDER",
computeVars = c("LOCAL_SPEND"),
fnList = list(sum = list(FUN = sum, x = NULL)))
Note that overall there are 3 methods of doing a groupBy in RevR as far as I know. Each has it's pros and cons.
rxSplit - This actually creates different XDF files for each group that you want. This can be used with the splitByFactor arg where the factor specifies which groups should be created.
RevoPemaR's PemaByGroup - This assumes that each group's data can be stored in RAM. Which is a fair assumption. It also needs the original Xdf file to be sorted by the GroupBy column. And it only supports grouping by 1 column.
dplyrXdf's group_by - This is a spin on the popular dplyr package. It has many variable manipulation methods - so a different way to write rxSplit and rxDataStep) using dplyr like syntax. It also only supports 1 column to group with.
All three methods currently only support a single variable group operation. Hence they all require some pre processing of the data to work with.

Here's a simple solution using dplyrXdf. Unlike with data frames, the n_distinct() summary function provided by dplyr doesn't work with xdf files, so this does a two-step summarisation: first including card_no as a grouping variable, and then count the number of card_no's.
First, generate some example data:
library(dplyrXdf) # also loads dplyr
set.seed(12345)
df <- expand.grid(year=2000:2005, gender=c("F", "M")) %>%
group_by(year, gender) %>%
do(data.frame(card_no=sample(20, size=10, replace=TRUE),
spend=rbinom(10, 1, 0.5) * runif(10) * 100))
xdf <- rxDataStep(df, "ndistinct.xdf", overwrite=TRUE)
Now call summarise twice, taking advantage of the fact that the first summarise will remove card_no from the list of grouping variables:
smry <- xdf %>%
mutate(trans=spend > 0) %>%
group_by(year, gender, card_no) %>%
summarise(n=n(), total_spend=sum(spend), no_of_trans=sum(trans)) %>%
summarise(cm_count=n(), total_spend=sum(total_spend), no_of_trans=sum(no_of_trans))
as.data.frame(smry)
#year gender cm_count total_spend no_of_trans
#1 2000 F 10 359.30313 6
#2 2001 F 8 225.89571 3
#3 2002 F 7 332.58365 6
#4 2003 F 5 333.72169 5
#5 2004 F 7 280.90448 5
#6 2005 F 9 254.37680 5
#7 2000 M 8 309.77727 6
#8 2001 M 8 143.70835 2
#9 2002 M 8 269.64968 5
#10 2003 M 8 265.27049 4
#11 2004 M 9 99.73945 3
#12 2005 M 8 178.12686 6
Verify that this is the same result (modulo row ordering) as you'd get by running a dplyr chain on the original data frame:
df %>%
group_by(year, gender) %>%
summarise(cm_count=n_distinct(card_no), total_spend=sum(spend), no_of_trans=sum(spend > 0)) %>%
arrange(gender, year)
#year gender cm_count total_spend no_of_trans
#<int> <fctr> <int> <dbl> <int>
#1 2000 F 10 359.30313 6
#2 2001 F 8 225.89571 3
#3 2002 F 7 332.58365 6
#4 2003 F 5 333.72169 5
#5 2004 F 7 280.90448 5
#6 2005 F 9 254.37680 5
#7 2000 M 8 309.77727 6
#8 2001 M 8 143.70835 2
#9 2002 M 8 269.64968 5
#10 2003 M 8 265.27049 4
#11 2004 M 9 99.73945 3
#12 2005 M 8 178.12686 6

Related

How to create conditional group tags with nested data in R?

My data looks like this:
I have 5 different levels with nested data:
Categories (e.g., "Countries")
Countries (e.g., "USA")
Cities (e.g., "New York")
Counties (e.g., "Manhattan")
Places (e.g., "Times Square")
Each row in my df (except for LVL 1 entries) is linked to a parent (a level above). For example: Times Square -> Manhatten -> New York -> USA -> Countries
For each Name, there is a corresponding n_values column, indicating the number of data entries.
My goal: I want to form groups with >=8 data entries. For groups with n_values <8, I want to merge them with the Parent column a level above. This new allocation should be expressed in a new variable new_group.
It is important to start in the lower levels first! For example, there are only 2 data entries for "Times Square" so we want to merge those entries with the parent "Manhattan". Manhattan now has 3+2=5 data entries. This is still <8 so we merge those 5 entries with the next parent "New York" which now hast 16+5=21 entries, so we're good.
I have tried to write a loop like this:
for (i in 5:1){
df %>% filter(Level==i) %>% group_by(ID) %>% summarize(n = n())
However, I fail to merge that information with the original data to create the dataset I want. Can anyone help?
The data:
structure(list(ID = c(19,12,3,41,50,6,77,83,9,105,11),
Parent = c(NA,19,12,3,41,12,19,77,77,19,105),
Level = c(1,2,3,4,5,3,2,3,3,2,3),
Name = c("Countries","USA","New York","Manhattan","Times Square",
"Boston","UK","London","Oxford","Canada","Vancouver"),
n_values = c(NA,17,16,3,2,13,12,7,8,9,8)),
class = "data.frame",
row.names = c(NA, -11L))
Let's assume that your data is stored in a data frame called df. The most straightforward approach would be to first sort the rows of the table by "Level" in descending order and set "new_group" to the values of "Name". We'll also track the per-group totals in a column called "new_values". Then iterate through the rows until a row with new_values < 8 is encountered, at which point that row's "new_group" is changed to that of its parent, and its "Parent" is also updated to match its parent's "Parent". At that point, the row loop restarts. The outer loop terminates when no "new_group"s have new_values < 8:
library(tidyverse)
df_sorted <- df %>%
arrange(desc(Level)) %>%
mutate(new_group = Name) %>%
group_by(new_group) %>%
mutate(new_values = sum(n_values)) %>%
ungroup
while (any(df_sorted$new_values < 8, na.rm = T)) {
for (i in 1:nrow(df_sorted)) {
if (df_sorted$new_values[i] < 8) {
to_id <- df_sorted$Parent[i]
to_row <- which(df_sorted$ID == to_id)
df_sorted$new_group[i] <- df_sorted$Name[to_row]
df_sorted$Parent[i] <- df_sorted$Parent[to_row]
df_sorted <- df_sorted %>%
group_by(new_group) %>%
mutate(new_values = sum(n_values)) %>%
ungroup
break # terminate the for loop immediately and return to the outer while loop
}
}
}
ID Parent Level Name n_values new_group new_values
<dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl>
1 50 12 5 Times Square 2 New York 21
2 41 12 4 Manhattan 3 New York 21
3 3 12 3 New York 16 New York 21
4 6 12 3 Boston 13 Boston 13
5 83 19 3 London 7 UK 19
6 9 77 3 Oxford 8 Oxford 8
7 11 105 3 Vancouver 8 Vancouver 8
8 12 19 2 USA 17 USA 17
9 77 19 2 UK 12 UK 19
10 105 19 2 Canada 9 Canada 9
11 19 NA 1 Countries NA Countries NA
Edit: The version below adds a "touched" column to track rows that have been modified in the loop, and also adds some checks for NA values. For the data set used above, it produces an identical result to the previous version. It also appears to work correctly on the data set below.
df <- structure(list(ID = c(19,12,3,41,50,6,77,83,9,105,11), Parent = c(NA,19,12,3,41,12,19,77,77,19,105), Level = c(1,2,3,4,5,3,2,3,3,2,3), Name = c("Countries","USA","New York","Manhattan","Times Square", "Boston","UK","London","Oxford","Canada","Vancouver"), n_values = c(NA,0,0,3,2,0,12,7,8,9,8)), class = "data.frame", row.names = c(NA, -11L))
df_sorted <- df %>%
arrange(desc(Level)) %>%
mutate(new_group = Name) %>%
group_by(new_group) %>%
mutate(
new_values = sum(n_values),
touched = is.na(n_values) | n_values >= 8
) %>%
ungroup
while (any(!df_sorted$touched)) {
for (i in 1:nrow(df_sorted)) {
if (df_sorted$new_values[i] < 8 & !is.na(df_sorted$Parent[i]) & any(!df_sorted$touched)) {
to_id <- df_sorted$Parent[i]
to_row <- which(df_sorted$ID == to_id)
df_sorted$new_group[i] <- df_sorted$Name[to_row]
df_sorted$Parent[i] <- df_sorted$Parent[to_row]
df_sorted$touched[i] <- TRUE
df_sorted <- df_sorted %>%
group_by(new_group) %>%
mutate(new_values = sum(n_values, na.rm = T)) %>%
ungroup
break # terminate the for loop immediately and return to the outer while loop
}
}
}
ID Parent Level Name n_values new_group new_values touched
<dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl> <lgl>
1 50 NA 5 Times Square 2 Countries 5 TRUE
2 41 NA 4 Manhattan 3 Countries 5 TRUE
3 3 NA 3 New York 0 Countries 5 TRUE
4 6 NA 3 Boston 0 Countries 5 TRUE
5 83 19 3 London 7 UK 19 TRUE
6 9 77 3 Oxford 8 Oxford 8 TRUE
7 11 105 3 Vancouver 8 Vancouver 8 TRUE
8 12 NA 2 USA 0 Countries 5 TRUE
9 77 19 2 UK 12 UK 19 TRUE
10 105 19 2 Canada 9 Canada 9 TRUE
11 19 NA 1 Countries NA Countries 5 TRUE

Creating a Function in dplyr that Operates on Columns through Variable/String Manipulation

I am working with a dataset that contains many columns that are similarly named (ex thing_1 , thing_2, blargh_1, blargh_2, fizz_1, fizz_2), and I've been trying to write a function that takes in a string (such as fizz) and performs some operation on all superstrings of the column (such as fizz_1 + fizz_2).
So far, I have structured my code into something that looks something like:
newData <- data %>%
mutate(fizz = f("fizz"))
f <- function(name) {
name_1 + name_2
}
where f as written obviously doesn't work. I've toyed around with assign, but not been terribly successful. I'm also open to other ways to tackle the problem (maybe a function that takes in a dataset and the string). Thanks!
If we are creating a function then make use of the select_helpers which can take starts_with or ends_with or match as arguments
library(dplyr)
library(purrr)
f1 <- function(data, name){
data %>%
mutate(!! name := select(., starts_with(name)) %>% reduce(`+`))
}
f1(df1, "fizz")
f1(df1, "blargh")
f1(df1, "thing")
# thing_1 thing_2 thing_3 fizz_1 fizz_2 blargh_1 blargh_2 thing
#1 1 6 11 2 3 4 5 18
#2 2 7 12 3 4 5 6 21
#3 3 8 13 4 5 6 7 24
#4 4 9 14 5 6 7 8 27
#5 5 10 15 6 7 8 9 30
Or specify select(., matches(str_c("^", name, "_\\d+$")))
data
df1 <- data.frame(thing_1 = 1:5, thing_2 = 6:10, thing_3 = 11:15,
fizz_1 = 2:6, fizz_2 = 3:7, blargh_1 = 4:8, blargh_2 = 5:9)

How to generate a dummy treatment variable based on values from two different variables

I would like to generate a dummy treatment variable "treatment" based on country variable "iso" and earthquakes dummy variable "quake" (for dataset "data").
I would basically like to get a dummy variable "treatment" where, if quake==1 for at least one time in my entire timeframe (let's say 2000-2018), I would like all values for that "iso" have "treatment"==1, for all other countries "iso"==0. So countries that are affected by earthquakes have all observations 1, others 0.
I have tried using dplyr but since I'm still very green at R, it has taken me multiple tries and I haven't found a solution yet. I've looked on this website and google.
I suspect the solution should be something along the lines of but I can't finish it myself:
data %>%
filter(quake==1) %>%
group_by(iso) %>%
mutate(treatment)
Welcome to StackOverflow ! You should really consider Sotos's links for your next questions on SO :)
Here is a dplyr solution (following what you started) :
## data
set.seed(123)
data <- data.frame(year = rep(2000:2002, each = 26),
iso = rep(LETTERS, times = 3),
quake = sample(0:1, 26*3, replace = T))
## solution (dplyr option)
library(dplyr)
data2 <- data %>% arrange(iso) %>%
group_by(iso) %>%
mutate(treatment = if_else(sum(quake) == 0, 0, 1))
data2
# A tibble: 78 x 4
# Groups: iso [26]
year iso quake treatment
<int> <fct> <int> <dbl>
1 2000 A 0 1
2 2001 A 1 1
3 2002 A 1 1
4 2000 B 1 1
5 2001 B 1 1
6 2002 B 0 1
7 2000 C 0 1
8 2001 C 0 1
9 2002 C 1 1
10 2000 D 1 1
# ... with 68 more rows

A clean way for adding variable-length values to data frame by group

I am creating random data. It should contain variables id and val where values cannot overlap within a single id but can overlap across id-s. Different id-s have different number of values n. I can create the desired result manually as:
n <- c(3,2,4)
data.frame(id=rep(letters[1:3], n),
val=c(sample(10, n[1]),
sample(10, n[2]),
sample(10, n[3])))
id val
1 a 5
2 a 10
3 a 4
4 b 9
5 b 10
6 c 10
7 c 5
8 c 2
9 c 9
I can also imagine different solutions involving looping over groups and using rbind, or using rep-ing the id-s by corresponding number of times. But all such approaches feel dirty, and may not scale to many variables and large data.
Are there any cleaner ways to achieve it? Something like (in dplyrish):
data.frame(id=letters[1:3]) %>%
mutate(i = row_number()) %>%
group_by(id) %>%
summarize_into_df(id=id, val=sample(10, n[i]))
You can loop through n with lapply, create a list column using sample, then unnest it:
library(dplyr)
library(tidyr)
n <- c(3,2,4)
data.frame(id = letters[1:length(n)]) %>%
mutate(val = lapply(n, sample, x=10)) %>%
unnest
# id val
#1 a 9
#2 a 4
#3 a 10
#4 b 4
#5 b 8
#6 c 5
#7 c 10
#8 c 8
#9 c 2
Or without using any package, which is very close to what you have, just replace manual construct with unlist(lapply(...)):
data.frame(id = rep(letters[1:length(n)], n),
val = unlist(lapply(n, sample, x=10)))

R: Consolidating duplicate observations?

I have a large data frame with approximately 500,000 observations (identified by "ID") and 150+ variables. Some observations only appear once; others appear multiple times (upwards of 10 or so). I would like to "collapse" these multiple observations so that there is only one row per unique ID, and that all information in columns 2:150 are concatenated. I do not need any calculations run on these observations, just a quick munging.
I've tried:
df.new <- group_by(df,"ID")
and also:
library(data.table)
dt = data.table(df)
dt.new <- dt[, lapply(.SD, na.omit), by = "ID"]
and unfortunately neither have worked. Any help is appreciated!
Using basic R:
df = data.frame(ID = c("a","a","b","b","b","c","d","d"),
day = c("1","2","3","4","5","6","7","8"),
year = c(2016,2017,2017,2016,2017,2016,2017,2016),
stringsAsFactors = F)
> df
ID day year
1 a 1 2016
2 a 2 2017
3 b 3 2017
4 b 4 2016
5 b 5 2017
6 c 6 2016
7 d 7 2017
8 d 8 2016
Do:
z = aggregate(df[,2:3],
by = list(id = df$ID),
function(x){ paste0(x, collapse = "/") }
)
Result:
> z
id day year
1 a 1/2 2016/2017
2 b 3/4/5 2017/2016/2017
3 c 6 2016
4 d 7/8 2017/2016
EDIT
If you want to avoid "collapsing" NA do:
z = aggregate(df[,2:3],
by = list(id = df$ID),
function(x){ paste0(x[!is.na(x)],collapse = "/") })
For a data frame like:
> df
ID day year
1 a 1 2016
2 a 2 NA
3 b 3 2017
4 b 4 2016
5 b <NA> 2017
6 c 6 2016
7 d 7 2017
8 d 8 2016
The result is:
> z
id day year
1 a 1/2 2016
2 b 3/4 2017/2016/2017
3 c 6 2016
4 d 7/8 2017/2016
I have had a similar problem in the past, but I wasn't dealing with several copies of the same data. It was in many cases just 2 instances and in some cases 3 instances. Below was my approach. Hopefully, it will help.
idx <- duplicated(df$key) | duplicated(df$key, fromLast=TRUE) # get the index of the duplicate entries. Or will help get the original value too.
dupes <- df[idx,] # get duplicated values
non_dupes <- df[!idx,] # get all non duplicated values
temp <- dupes %>% group_by(key) %>% # roll up the duplicated ones.
fill_(colnames(dupes), .direction = "down") %>%
fill_(colnames(dupes), .direction = "up") %>%
slice(1)
Then it is easy to merge back the temp and the non_dupes.
EDIT
I would highly recommend to filter the df to the only the population as much as possible and relevant for your end goal as this process could take some time.
What about?
df %>%
group_by(ID) %>%
summarise_each(funs(paste0(., collapse = "/")))
Or reproducible...
iris %>%
group_by(Species) %>%
summarise_each(funs(paste0(., collapse = "/")))

Resources