How can I convert groupedData into Dataframe in R - r

Consider I have the below dataframe
AccountId,CloseDate
1,2015-05-07
2,2015-05-09
3,2015-05-01
4,2015-05-07
1,2015-05-09
1,2015-05-12
2,2015-05-12
3,2015-05-01
3,2015-05-01
3,2015-05-02
4,2015-05-17
1,2015-05-12
I want to group it based on AccountId and then I want to add another column naming date_diff which will contain the difference in CloseDate between the current row and previous row. Please note that I want this date_diff to be calculated only for rows having same AccountId. So I need to group the data before adding another column
Below is the R code that I am using
df <- read.df(sqlContext, "/home/ubuntu/work/csv/sample.csv", source = "com.databricks.spark.csv", inferSchema = "true", header="true")
df$CloseDate <- to_date(df$CloseDate)
groupedData <- SparkR::group_by(df, df$AccountId)
SparkR::mutate(groupedData, DiffCloseDt = as.numeric(SparkR::datediff((CloseDate),(SparkR::lag(CloseDate,1)))))
To add another column I am using mutate. But as the group_by returns groupedData I am not able to use mutate here. I am getting the below error
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘mutate’ for signature ‘"GroupedData"’
So how can I convert GroupedData into Dataframe so that I can add columns using mutate?

What you want is not possible to achieve using group_by. As already explained quite a few times on SO :
Using groupBy in Spark and getting back to a DataFrame
How to do custom operations on GroupedData in Spark?
DataFrame groupBy behaviour/optimization
group_by on a DataFrame doesn't physical group the data. Moreover order of operations after applying group_by is nondeterministic.
To achieve desired output you'll have to use window functions and provide an explicit ordering:
df <- structure(list(AccountId = c(1L, 2L, 3L, 4L, 1L, 1L, 2L, 3L,
3L, 3L, 4L, 1L), CloseDate = structure(c(3L, 4L, 1L, 3L, 4L,
5L, 5L, 1L, 1L, 2L, 6L, 5L), .Label = c("2015-05-01", "2015-05-02",
"2015-05-07", "2015-05-09", "2015-05-12", "2015-05-17"), class = "factor")),
.Names = c("AccountId", "CloseDate"),
class = "data.frame", row.names = c(NA, -12L))
hiveContext <- sparkRHive.init(sc)
sdf <- createDataFrame(hiveContext, df)
registerTempTable(sdf, "df")
query <- "SELECT *, LAG(CloseDate, 1) OVER (
PARTITION BY AccountId ORDER BY CloseDate
) AS DateLag FROM df"
dfWithLag <- sql(hiveContext, query)
withColumn(dfWithLag, "diff", datediff(dfWithLag$CloseDate, dfWithLag$DateLag)) %>%
head()
## AccountId CloseDate DateLag diff
## 1 1 2015-05-07 <NA> NA
## 2 1 2015-05-09 2015-05-07 2
## 3 1 2015-05-12 2015-05-09 3
## 4 1 2015-05-12 2015-05-12 0
## 5 2 2015-05-09 <NA> NA
## 6 2 2015-05-12 2015-05-09 3

Related

Creating matched pairs based on condition

Suppose I have a table in the following format:
CowId DIM Type
1 13 Case
2 7 Case
3 3 Control
4 4 Control
5 9 Control
6 3 Control
7 5 Control
8 10 Control
9 1 Control
10 6 Control
11 7 Control
12 4 Control
I would like to randomly match Cases to Controls (1 to 1) based on +/- 3 DIM. Is there a convenient way to accomplish this task using dplyr? Any feedback would be appreciated.
Output from dput is appended:
structure(list(CowId = 1:12, DIM = c(13L, 7L, 3L, 4L, 9L, 3L,
5L, 10L, 1L, 6L, 7L, 4L), Type = structure(c(2L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Control", "Case"
), class = "factor")), row.names = c(NA, -12L), class = "data.frame")
A way in base R :
#Get the index where Type = 'Case'
inds <- df$Type == 'Case'
#Get all the values within -3-3 for each DIM value
vals <- unique(c(sapply(df$DIM[inds], `+`, -3:3)))
#select random rows within range
result <- sample(which(df$DIM %in% vals & !inds), sum(inds))
#Combine case and control data.
df[c(which(inds), result), ]
# CowId DIM Type
#1 1 13 Case
#2 2 7 Case
#5 5 9 Control
#10 10 6 Control
The part randomly could be tricky. Here is my approach:
For each case Id calculate the min/max DIM
Then randomly picked either 1 or half of available Control available to them
Update the Control picked with reference to CAse ID and excluded those rows from future pick.
Repeat this step till done for all Case
In case of no picked was available a message will popup.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(magrittr)
df <- structure(list(CowId = 1:12, DIM = c(13L, 7L, 3L, 4L, 9L, 3L,
5L, 10L, 1L, 6L, 7L, 4L), Type = structure(c(2L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Control", "Case"
), class = "factor")), row.names = c(NA, -12L), class = "data.frame")
# create variable for tracking sample picking process
df %<>% mutate(Picked = FALSE, Case_ID = -1)
# get list of case - assume the df is unique
list_case_id <- df$CowId[df$Type == "Case"]
for (i_case_id in list_case_id) {
# calculate the min/max DIM
current_case <- df %>% filter(CowId == i_case_id)
expecting_DIM_min <- current_case$DIM - 3
expecting_DIM_max <- current_case$DIM + 3
# Pick with sample
possible_sample <- df %>%
filter(Type == "Control", DIM >= expecting_DIM_min & DIM <= expecting_DIM_max,
Picked == FALSE)
if (nrow(possible_sample) == 0) {
message("There is no possible sample for Case ID: ", i_case_id)
message("DIM Range is: ", expecting_DIM_min, " - ", expecting_DIM_max)
} else {
max_sample <- nrow(possible_sample)
# Maximum pick - in this case OP ask for 1 - 1 matched
# pick_number <- max(1, max_sample / 2)
pick_number <- 1
sample <- possible_sample %>%
sample_n(size = 1)
df$Picked[df$CowId %in% sample$CowId] <- TRUE
df$Case_ID[df$CowId %in% sample$CowId] <- i_case_id
}
}
Here is an output
df %>% filter(Picked | Type == "Case")
#> CowId DIM Type Picked Case_ID
#> 1 1 13 Case FALSE -1
#> 2 2 7 Case FALSE -1
#> 3 8 10 Control TRUE 1
#> 4 10 6 Control TRUE 2
Updated: matching 1-1 only
Created on 2021-04-10 by the reprex package (v1.0.0)

Selectively scale a variable in R

Suppose you have the following dataframe named data:
Country V1 V2
US 1 2
US 2 1
US 3 1
UK 1 1
UK 2 1
UK 3 3
...
IT 2 2
Now I want to scale the variables V1 and V2. The first idea would be to use something like:
data %>%
mutate_at(.vars = c("V1", "V2"), .funs = scale)
But, what if I want to perform scaling separately for each value of the Country variable and have the result all in one dataframe?
This is just an example and the actual data which I am not able to provide contains a lot of NA. I am worried that if I use select or some of the other functions the data won't be joined back properly because of NA.
If we want to have as separate data.frame/tibble, then one option is map and store it in a list
library(dplyr)
map(c("V1", "V2"), ~ data %>%
select(Country, .x) %>%
group_by(Country)
scale)
Or if we need to do a group_by
data %>%
group_by(Country) %>%
mutate_at(vars(V1, V2), ~ c(scale(.)))
Here is solution with base R (given data frame df as in the post)
res <- (r<-Reduce(rbind,lapply(split(df,df$Country), function(x) {x[-1]<-scale(x[-1]);x})))[order(as.numeric(rownames(r))),]
such that
> res
Country V1 V2
1 US -1 1.1547005
2 US 0 -0.5773503
3 US 1 -0.5773503
4 UK -1 -0.5773503
5 UK 0 -0.5773503
6 UK 1 1.1547005
7 IT NaN NaN
DATA
df <- structure(list(Country = structure(c(3L, 3L, 3L, 2L, 2L, 2L,
1L), .Label = c("IT", "UK", "US"), class = "factor"), V1 = c(1L,
2L, 3L, 1L, 2L, 3L, 2L), V2 = c(2L, 1L, 1L, 1L, 1L, 3L, 2L)), class = "data.frame", row.names = c(NA,
-7L))

Is there an R function to group a table by a certain variable? [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 3 years ago.
I am trying to remove some rows of my data by adding them to a different row, in the form of another column. Is there a way I can group rows together by a certain variable?
I have tried using group_by statement in the dplyr package, but it does not seem to solve my issue.
library(dplyr)
late <- read.csv(file.choose())
late <- group_by(late, state, add = FALSE)
The data set I have (named "late") now is in this form:
ontime state count
0 AL 1
1 AL 44
null AL 3
0 AR 5
1 AR 50
...
But I would like it to be:
state count0 count1 countnull
AL 1 44 3
AR 5 50 null
...
Ultimately, I want to calculate count0/count1 for each state. So if there is a better way of going about this, I would be open to any suggestions.
You could do this with dcast() from the reshape2 package
library(reshape2)
df = data.frame(
ontime = c(0,1,NA,0,1),
state = c("AL","AL","AL","AR","AR"),
count = c(1,44,3,5,50)
)
dcast(df,state~ontime,value=count)
With spread:
library(dplyr)
library(tidyr)
df %>%
mutate(ontime = paste0('count', ontime)) %>%
spread(ontime, count)
Output:
state count0 count1 countnull
1 AL 1 44 3
2 AR 5 50 NA
Data:
df <- structure(list(ontime = structure(c(1L, 2L, 3L, 1L, 2L), .Label = c("0",
"1", "null"), class = "factor"), state = structure(c(1L, 1L,
1L, 2L, 2L), .Label = c("AL", "AR"), class = "factor"), count = c(1L,
44L, 3L, 5L, 50L)), class = "data.frame", row.names = c(NA, -5L
))

Conditional data manipulation using data.table in R

I have 2 dataframes, testx and testy
testx
testx <- structure(list(group = 1:2), .Names = "group", class = "data.frame", row.names = c(NA,
-2L))
testy
testy <- structure(list(group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
time = c(1L, 3L, 4L, 1L, 4L, 5L, 1L, 5L, 7L), value = c(50L,
52L, 10L, 4L, 84L, 2L, 25L, 67L, 37L)), .Names = c("group",
"time", "value"), class = "data.frame", row.names = c(NA, -9L
))
Based on this topic, I add missing time values using the following code, which works perfectly.
data <- setDT(testy, key='time')[, .SD[J(min(time):max(time))], by = group]
Now I would like to only add these missing time values IF the value for group appears in testx. In this example, I thus only want to add missing time values for groups matching the values for group in the file testx.
data <- setDT(testy, key='time')[,if(testy[group %in% testx[, group]]) .SD[J(min(time):max(time))], by = group]
The error I get is "undefined columns selected". I looked here, here and here, but I don't see why my code isn't working. I am doing this on large datasets, why I prefer using data.table.
You don't need to refer testy when you are within testy[] and are using group by, directly using group as a variable gives correct result, you need an extra else statement to return rows where group is not within testx if you want to keep all records in testy:
testy[, {if(group %in% testx$group) .SD[J(min(time):max(time))] else .SD}, by = group]
# group time value
# 1: 1 1 50
# 2: 1 2 NA
# 3: 1 3 52
# 4: 1 4 10
# 5: 2 1 4
# 6: 2 2 NA
# 7: 2 3 NA
# 8: 2 4 84
# 9: 2 5 2
# 10: 3 1 25
# 11: 3 5 67
# 12: 3 7 37

Removing duplicate rows with ddply

I have a dataframe df containing two factor variables (Var and Year) as well as one (in reality several) column with values.
df <- structure(list(Var = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L,
3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), Year = structure(c(1L,
2L, 3L, 1L, 2L, 3L, 3L, 1L, 2L, 3L), .Label = c("2000", "2001",
"2002"), class = "factor"), Val = structure(c(1L, 2L, 2L, 4L,
1L, 3L, 3L, 5L, 6L, 6L), .Label = c("2", "3", "4", "5", "8",
"9"), class = "factor")), .Names = c("Var", "Year", "Val"), row.names = c(NA,
-10L), class = "data.frame")
> df
Var Year Val
1 A 2000 2
2 A 2001 3
3 A 2002 3
4 B 2000 5
5 B 2001 2
6 B 2002 4
7 B 2002 4
8 C 2000 8
9 C 2001 9
10 C 2002 9
Now I'd like to find rows with the same value for Val for each Var and Year and only keep one of those. So in this example I would like row 7 to be removed.
I've tried to find a solution with plyr using something like
df_new <- ddply(df, .(Var, Year), summarise, !duplicate(Val))
but obviously that is not a function accepted by ddply.
I found this similar question but the plyr solution by Arun only gives me a dataframe with 0 rows and 0 columns and I do not understand the answer well enough to modify it according to my needs.
Any hints on how to go about that?
Non-duplicates of Val by Var and Year are the same as non-duplicates of Val, Var, and Year. You can specify several columns for duplicated (or the whole data frame).
I think this does what you'd like.
df[!duplicated(df), ]
Or.
df[!duplicated(df[, c("Var", "Year", "Val")]), ]
you can just used the unique() function instead of !duplicate(Val)
df_new <- ddply(df, .(Var, Year), summarise, Val=unique(Val))
# or
df_new <- ddply(df, .(Var, Year), function(x) x[!duplicated(x$Val),])
# or if you only have these 3 columns:
df_new <- ddply(df, .(Var, Year), unique)
# with dplyr
df%.%group_by(Var, Year)%.%filter(!duplicated(Val))
hth
You don't need the plyr package here. If your whole dataset consists of only these 3 columns and you need to remove the duplicates, then you can use,
df_new <- unique(df)
Else, if you need to just pick up the first observation for a group by variable list, then you can use the method suggested by Richard. That's usually how I have been doing it.

Resources