Count based on multiple conditions from other data.frame - r

I am migrating analysis from Excel to R, and would like some input on how best to perform something similar to Excel's COUNTIFS in R.
I have a two data.frames, statedf and memberdf.
statedf=data.frame(state=c('MD','MD','MD','NY','NY','NY'), week = 5:7)
memberdf=data.frame(memID = 1:15, state = c('MD','MD','NY','NY','MD'),
finalweek = c(3,3,5,3,3,5,3,5,3,5,6,5,2,3,5),
orders = c(1,2,3))
This data is for a subscription-based business. I would like to know the number of members who newly lapsed for each week/state combo in statedf, where newly lapse is defined by statedf$week - 1 = memberdf$finalweek. Further I would like to have separate counts for each order value (1,2,3).
The desired output would look like
out <- data.frame(state=c('MD','MD','MD','NY','NY','NY'), week = 5:7,
oneorder = c(0,1,0,0,0,0),
twoorder = c(0,0,1,0,1,0),
threeorder = c(0,3,0,0,1,0))
I asked (and got a great response for) a simpler version of this question yesterday - the answers revolved around creating a new data.frame based on member.df. However, I need to append the data to statedf, because statedf has member/week combos that don't exist in memberdf, and vice versa. If this was in Excel, I'd use COUNTIFS but am struggling for a solution in R.
Thanks.

Here is a solution with the dplyr and tidyr packages:
library(tidyr) ; library(dplyr)
counts <- memberdf %>%
mutate(lapsedweek = finalweek + 1) %>%
group_by(state, lapsedweek, orders) %>%
tally()
counts <- counts %>% spread(orders, n, fill = 0)
out <- left_join(statedf, counts, by = c("state", "week" = "lapsedweek"))
out[is.na(out)] <- 0 # convert rows with all NAs to 0s
names(out)[3:5] <- paste0("order", names(out)[3:5]) # rename columns

We could create a new variable ('week1') in the 'statedf' dataset, merge the 'memberdf' with 'statedf', and then reshape from 'long' to 'wide' format with dcast. I changed the 'orders' column to match the column names in the 'out'.
statedf$week1 <- statedf$week-1
df1 <- merge(memberdf[-1], statedf, by.x=c('state', 'finalweek'),
by.y=c('state', 'week1'), all.y=TRUE)
lvls <- paste0(c('one', 'two', 'three'), 'order')
df1$orders <- factor(lvls[df1$orders],levels=lvls)
library(reshape2)
out1 <- dcast(df1, state+week~orders, value.var='orders', length)[-6]
out1
# state week oneorder twoorder threeorder
#1 MD 5 0 0 0
#2 MD 6 1 0 3
#3 MD 7 0 1 0
#4 NY 5 0 0 0
#5 NY 6 0 1 1
#6 NY 7 0 0 0
all.equal(out, out1)
#[1] TRUE

Related

How to count the occurrence of a word in multiple variables in R and sort them from highest to lowest?

I have a huge dataset with over 3 million obs and 108 columns. There are 14 variables I'm interested in: DIAG_PRINC, DIAG_SECUN, DIAGSEC1:DIAGSEC9, CID_ASSO, CID_MORTE and CID_NOTIF (they're in different positions). These variables contain ICD-10 codes.
I'm interested in counting how many times certain ICD-10 codes appear and then sort them from highest to lowest in a dataframe. Here's some reproductible data:
data <- data.frame(DIAG_PRINC = c("O200", "O200", "O230"),
DIAG_SECUN = c("O555", "O530", "O890"),
DIAGSEC1 = c("O766", "O876", "O899"),
DIAGSEC2 = c("O200", "I520", "O200"),
DIAGSEC3 = c("O233", "O200", "O620"),
DIAGSEC4 = c("O060", "O061", "O622"),
DIAGSEC5 = c("O540", "O123", "O344"),
DIAGSEC6 = c("O876", "Y321", "S333"),
DIAGSEC7 = c("O450", "X900", "O541"),
DIAGSEC8 = c("O222", "O111", "O123"),
DIAGSEC9 = c("O987", "O123", "O622"),
CID_MORTE = c("O066", "O699", "O555"),
CID_ASSO = c("O600", "O060", "O068"),
CID_NOTIF = c("O069", "O066", "O065"))
I also have a list of ICD-10 codes that I'm interested in counting.
GRUPO1 <- c("O00", "O000", "O001", "O002", "O008", "O009",
"O01", "O010", "O011", "O019",
"O02", "O020", "O021", "O028", "O029",
"O03", "O030", "O031", "O032", "O033", "O034", "O035", "O036", "O037",
"O038", "O039",
"O04", "O040", "O041", "O042", "O043", "O044", "O045", "O046", "O047",
"O048", "O049",
"O05", "O050", "O051", "O052", "O053", "O054", "O055", "O056", "O057",
"O058", "O059",
"O06", "O060", "O061", "O062", "O063", "O064", "O065", "O066", "O067",
"O068", "O069",
"O07", "O070", "O071", "O072", "O073", "O074", "O075", "O076", "O077",
"O078", "O079",
"O08", "O080", "O081", "O082", "O083", "O084", "O085", "O086", "O087",
"O088", "O089")
What I need is a dataframe counting how many times the ICD-10 codes from "GRUPO1" appear in any row/column from DIAG_PRINC, DIAG_SECUN, DIAGSEC1:DIAGSEC9, CID_ASSO, CID_MORTE and CID_NOTIF variables. For example, on my reproductible data ICD-10 cod "O066" appears twice.
Thank you in advance!
We can unlist the data into a vector, use %in% to subset the values from 'GRUPO1' and get the frequency count with table in base R
v1 <- unlist(data)
out <- table(v1[v1 %in% GRUPO1])
out[order(-out)]
O060 O066 O061 O065 O068 O069
2 2 1 1 1 1
Here is a tidyverse solution using tidyr and dplyr:
library(tidyverse)
pivot_longer(data, everything()) %>%
filter(value %in% GRUPO1) %>%
count(value)
Output
value n
<chr> <int>
1 O060 2
2 O061 1
3 O065 1
4 O066 2
5 O068 1
6 O069 1

R fastdummies equivalent in sparkR

I have a Spark dataframe with the following data:
categories
1 John
2 Luis
3 Dora
For which I need to create a one hot ending version as:
categories categories_Dora categories_John categories_Luis
1 John 0 1 0
2 Luis 0 0 1
3 Dora 1 0 0
This is the current code I have:
test <- data.frame("SN" = 1:2, "Age" = c(21,15), "Name" = c("John;Luis","Dora"))
df <- as.DataFrame(test)
df_2 = selectExpr(df, "split(Name, ';') AS categories","Name")
dat <- df_2 %>%
mutate(categories=explode(df_2$categories)) %>%
select("categories")
The current solution I have is to convert this to a regular R dataframe,
and apply the fast dummies function. Which works for this case but it wont´t
work properly for a large dataset:
r_df = dat %>%
SparkR::collect()
dummy_r = dummy_cols(r_df)
How can I get the same result using sparkR dataframes?
EDIT:
I can not use sparklyr only sparkR
It can be done with Sparklyr which has many of the feature transformer functions exposed.
library(sparklyr)
test <- data.frame("categories" = c("John", "Luis","Dora"))
sc <- sparklyr::spark_connect(master = "local")
d_tbl <- copy_to(sc, test, overwrite = TRUE)
d_tbl %>%
ft_string_indexer(input_col = "categories", output_col = "cat_num") %>%
mutate(cat_num = cat_num + 1) %>%
ft_one_hot_encoder("cat_num", "cat_onehot") %>%
sdf_separate_column("cat_onehot",
paste("categories", pull(., categories), sep="_")) %>%
select(-cat_num, -cat_onehot)
The output:
# Source: spark<?> [?? x 4]
categories categories_John categories_Luis categories_Dora
<chr> <dbl> <dbl> <dbl>
1 John 0 0 0
2 Luis 0 1 0
3 Dora 0 0 1
The ft_string_indexer generates a column names cat_num which has the a numeric value for each category. Very similar to as.numeric(factor) in R. The +1 is just to have the indexes from 1 to N. ft_one_hot_encoder does the magic at Spark level, the function return a vectorised value like a list with the encoding. The function sdf_separate_column expands the encoding to columns. The paste generates the colnames using the category levels. The select drops unnecessary columns used in the transformation.

How can i add more columns in dataframe by for loop

I am beginner of R. I need to transfer some Eviews code to R. There are some loop code to add 10 or more columns\variables with some function in data in Eviews.
Here are eviews example code to estimate deflator:
for %x exp con gov inv cap ex im
frml def_{%x} = gdp_{%x}/gdp_{%x}_r*100
next
I used dplyr package and use mutate function. But it is very hard to add many variables.
library(dplyr)
nominal_gdp<-rnorm(4)
nominal_inv<-rnorm(4)
nominal_gov<-rnorm(4)
nominal_exp<-rnorm(4)
real_gdp<-rnorm(4)
real_inv<-rnorm(4)
real_gov<-rnorm(4)
real_exp<-rnorm(4)
df<-data.frame(nominal_gdp,nominal_inv,
nominal_gov,nominal_exp,real_gdp,real_inv,real_gov,real_exp)
df<-df %>% mutate(deflator_gdp=nominal_gdp/real_gdp*100,
deflator_inv=nominal_inv/real_inv,
deflator_gov=nominal_gov/real_gov,
deflator_exp=nominal_exp/real_exp)
print(df)
Please help me to this in R by loop.
The answer is that your data is not as "tidy" as it could be.
This is what you have (with an added observation ID for clarity):
library(dplyr)
df <- data.frame(nominal_gdp = rnorm(4),
nominal_inv = rnorm(4),
nominal_gov = rnorm(4),
real_gdp = rnorm(4),
real_inv = rnorm(4),
real_gov = rnorm(4))
df <- df %>%
mutate(obs_id = 1:n()) %>%
select(obs_id, everything())
which gives:
obs_id nominal_gdp nominal_inv nominal_gov real_gdp real_inv real_gov
1 1 -0.9692060 -1.5223055 -0.26966202 0.49057546 2.3253066 0.8761837
2 2 1.2696927 1.2591910 0.04238958 -1.51398652 -0.7209661 0.3021453
3 3 0.8415725 -0.1728212 0.98846942 -0.58743294 -0.7256786 0.5649908
4 4 -0.8235101 1.0500614 -0.49308092 0.04820723 -2.0697008 1.2478635
Consider if you had instead, in df2:
obs_id variable real nominal
1 1 gdp 0.49057546 -0.96920602
2 2 gdp -1.51398652 1.26969267
3 3 gdp -0.58743294 0.84157254
4 4 gdp 0.04820723 -0.82351006
5 1 inv 2.32530662 -1.52230550
6 2 inv -0.72096614 1.25919100
7 3 inv -0.72567857 -0.17282123
8 4 inv -2.06970078 1.05006136
9 1 gov 0.87618366 -0.26966202
10 2 gov 0.30214534 0.04238958
11 3 gov 0.56499079 0.98846942
12 4 gov 1.24786355 -0.49308092
Then what you want to do is trivial:
df2 %>% mutate(deflator = real / nominal)
obs_id variable real nominal deflator
1 1 gdp 0.49057546 -0.96920602 -0.50616221
2 2 gdp -1.51398652 1.26969267 -1.19240392
3 3 gdp -0.58743294 0.84157254 -0.69801819
4 4 gdp 0.04820723 -0.82351006 -0.05853872
5 1 inv 2.32530662 -1.52230550 -1.52749012
6 2 inv -0.72096614 1.25919100 -0.57256297
7 3 inv -0.72567857 -0.17282123 4.19901294
8 4 inv -2.06970078 1.05006136 -1.97102841
9 1 gov 0.87618366 -0.26966202 -3.24919196
10 2 gov 0.30214534 0.04238958 7.12782060
11 3 gov 0.56499079 0.98846942 0.57158146
12 4 gov 1.24786355 -0.49308092 -2.53074800
So the question becomes: how do we get to the nice dplyr-compatible data.frame.
You need to gather your data using tidyr::gather. However, because you have 2 sets of variables to gather (the real and nominal values), it is not straightforward. I have done it in two steps, there may be a better way though.
real_vals <- df %>%
select(obs_id, starts_with("real")) %>%
# the line below is where the magic happens
tidyr::gather(variable, real, starts_with("real")) %>%
# extracting the variable name (by erasing up to the underscore)
mutate(variable = gsub(variable, pattern = ".*_", replacement = ""))
# Same thing for nominal values
nominal_vals <- df %>%
select(obs_id, starts_with("nominal")) %>%
tidyr::gather(variable, nominal, starts_with("nominal")) %>%
mutate(variable = gsub(variable, pattern = ".*_", replacement = ""))
# Merging them... Now we have something we can work with!
df2 <-
full_join(real_vals, nominal_vals, by = c("obs_id", "variable"))
Note the importance of the observation id when merging.
We can grep the matching names, and sort:
x <- colnames(df)
df[ sort(x[ (grepl("^nominal", x)) ]) ] /
df[ sort(x[ (grepl("^real", x)) ]) ] * 100
Similarly, if the columns were sorted, then we could just:
df[ 1:4 ] / df[ 5:8 ] * 100
We can loop over column names using purrr::map_dfc then apply a custom function over the selected columns (i.e. the columns that matched the current name from nms)
library(dplyr)
library(purrr)
#Replace anything before _ with empty string
nms <- unique(sub('.*_','',names(df)))
#Use map if you need the ouptut as a list not a dataframe
map_dfc(nms, ~deflator_fun(df, .x))
Custom function
deflator_fun <- function(df, x){
#browser()
nx <- paste0('nominal_',x)
rx <- paste0('real_',x)
select(df, matches(x)) %>%
mutate(!!paste0('deflator_',quo_name(x)) := !!ensym(nx) / !!ensym(rx)*100)
}
#Test
deflator_fun(df, 'gdp')
nominal_gdp real_gdp deflator_gdp
1 -0.3332074 0.181303480 -183.78433
2 -1.0185754 -0.138891362 733.36121
3 -1.0717912 0.005764186 -18593.97398
4 0.3035286 0.385280401 78.78123
Note: Learn more about quo_name, !!, and ensym which they are tools for programming with dplyr here

Building dummy variable with many conditions (R)

My dataset looks something like this
ID YOB ATT94 GRADE94 ATT96 GRADE96 ATT 96 .....
1 1975 1 12 0 NA
2 1985 1 3 1 5
3 1977 0 NA 0 NA
4 ......
(with ATTXX a dummy var. denoting attendance at school in year XX, GRADEXX denoting the school grade)
I'm trying to create a dummy variable that = 1 if an individual is attending school when they are 19/20 years old. e.g. if YOB = 1988 and ATT98 = 1 then the new variable = 1 etc. I've been attempting this using mutate in dplyr but I'm new to R (and coding in general!) so struggle to get anything other than an error any code I write.
Any help would be appreciated, thanks.
Edit:
So, I've just noticed that something has gone wrong, I changed your code a bit just to add another column to the long format data table. Here is what I did in the end:
df %>%
melt(id = c("ID", "DOB") %>%
tbl_df() %>%
mutate(dummy = ifelse(value - DOB %in% c(19,20), 1, 0))
so it looks something like e.g.
ID YOB VARIABLE VALUE dummy
1 1979 ATT94 1994 1
1 1979 ATT96 1996 1
1 1979 ATT98 0 0
2 1976 ATT94 0 0
2 1976 ATT96 1996 1
2 1976 ATT98 1998 1
i.e. whenever the ATT variables take a value other than 0 the dummy = 1, even if they're not 19/20 years old. Any ideas what could be going wrong?
On my phone so I can't check this right now but try:
df$dummy[df$DOB==1988 & df$ATT98==1] <- 1
Edit: The above approach will create the column but when the condition does not hold it will be equal to NA
As #Greg Snow mentions, this approach assumes that the column was already created and is equal to zero initially. So you can do the following to get your dummy variable:
df$dummy <- rep(0, nrow(df))
df$dummy[df$DOB==1988 & df$ATT98==1] <- 1
Welcome to the world of code! R's syntax can be tricky (even for experienced coders) and dplyr adds its own quirks. First off, it's useful when you ask questions to provide code that other people can run in order to be able to reproduce your data. You can learn more about that here.
Are you trying to create code that works for all possible values of DOB and ATTx? In other words, do you have a whole bunch of variables that start with ATT and you want to look at all of them? That format is called wide data, and R works much better with long data. Fortunately the reshape2 package does exactly that. The code below creates a dummy variable with a value of 1 for people who were in school when they were either 19 or 20 years old.
# Load libraries
library(dplyr)
library(reshape2)
# Create a sample dataset
ATT94 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
ATT96 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
ATT98 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
DOB <- rnorm(500, mean = 1977, sd = 5) %>% round(digits = 0)
df <- cbind(DOB, ATT94, ATT96, ATT98) %>% data.frame()
# Recode ATTx variables with the actual year
df$ATT94[df$ATT94==1] <- 1994
df$ATT96[df$ATT96==1] <- 1996
df$ATT98[df$ATT98==1] <- 1998
# Melt the data into a long format and perform requested analysis
df %>%
melt(id = "DOB") %>%
tbl_df() %>%
mutate(dummy = ifelse(value - DOB %in% c(19,20), 1, 0))
#Warner shows a way to create the variable (or at least the 1's the assumption is the column has already been set to 0). Another approach is to not explicitly create a dummy variable, but have it created for you in the model syntax (what you asked for is essentially an interaction). If running a regression, this would be something like:
fit <- lm( resp ~ I(DOB==1988):I(ATT98==1), data=df )
or
fit <- lm( resp ~ I( (DOB==1988) & (ATT98==1) ), data=df)

Create new index / re-index in dplyr [duplicate]

This question already has answers here:
How to number/label data-table by group-number from group_by?
(6 answers)
Closed 6 years ago.
I am using a dplyr table in R. Typical fields would be a primary key, an id number identifying a group, a date field, and some values. There are numbersI did some manipulation that throws out a bunch of data in some preliminary steps.
In order to do the next step of my analysis (in MC Stan), It'll be easier if both the date and the group id fields are integer indices. So basically, I need to re-index them as integers between 1 and whatever the total number of distinct elements are (about 750 for group_id and about 250 for date_id, the group_id is already integer, but the date is not). This is relatively straightforward to do after exporting it to a data frame, but I was curious if it is possible in dplyr.
My attempt at creating a new date_val (called date_val_new) is below. Per the discussion in the comments I have some fake data. I purposefully made the group and date values not be 1 to whatever, but I didn't make the date an actual date. I made the data unbalanced, removing some values to illustrate the issue. The dplyr command re-starts the index at 1 for each new group, regardless of what date_val it is. So every group starts at 1, even if the date is different.
df1 <- data.frame(id = 1:40,
group_id = (10 + rep(1:10, each = 4)),
date_val = (20 + rep(rep(1:4), 10)),
val = runif(40))
for (i in c(5, 17, 33))
{
df1 <- df1[!df1$id == i, ]
}
df_new <- df1 %>%
group_by(group_id) %>%
arrange(date_val) %>%
mutate(date_val_new=row_number(group_id)) %>%
ungroup()
This is the base R method:
df1 %>% mutate(date_val_new = match(date_val, unique(date_val)))
Or with a data.table, df1[, date_val_new := .GRP, by=date_val].
Use group_indices_() to generate a unique id for each group:
df1 %>% mutate(date_val_new = group_indices_(., .dots = "date_val"))
Update
Since group_indices() does not handle class tbl_postgres, you could try dense_rank()
copy_to(my_db, df1, name = "df1")
tbl(my_db, "df1") %>%
mutate(date_val_new = dense_rank(date_val))
Or build a custom query using sql()
tbl(my_db, sql("SELECT *,
DENSE_RANK() OVER (ORDER BY date_val) AS DATE_VAL_NEW
FROM df1"))
Alternatively, I think you can try getanID() from the splitstackshape package.
library(splitstackshape)
getanID(df1, "group_id")[]
# id group_id date_val val .id
# 1: 1 11 21 0.01857242 1
# 2: 2 11 22 0.57124557 2
# 3: 3 11 23 0.54318903 3
# 4: 4 11 24 0.59555088 4
# 5: 6 12 22 0.63045007 1
# 6: 7 12 23 0.74571297 2
# 7: 8 12 24 0.88215668 3

Resources