I've got a large dataset (millions of records) with this structure:
id | ident1 | ident2
1 A000001 B000001
2 A000001 B000002
................
99 A000001 B000099
.........
337 A000002 B000037
338 A000002 B000043
In other words, for each [ident1], I have a high number of entries in [ident2]. I'd like to be able to select only 20 of these entries (of all of them, if there's less than 20).
Order is not important: so if a given ident1 has 100 matching [ident2], I'd like either the first 20 entries, or 20 random ones, it doesn't matter.
Thanks in advance, p.
Try
library(dplyr)
df %>%
group_by(ident1) %>%
slice(1:20)
Or using data.table
library(data.table)
setDT(df)[, head(.SD,20), by=ident1]
If you need a sample
setDT(df)[df[, .I[sample(.N,20, replace=FALSE)], by=ident1]$V1]
If some of the groups have less than 20 rows to sample
setDT(df)[,if(.N < 20) .SD else .SD[sample(.N,20, replace=FALSE)], by=group]
As the #akrun answer, I use dplyr, but in my case you are selecting observations randomly.
library(dplyr)
df %>%
group_by(ident1) %>%
sample_n(20)
or:
library(dplyr)
df %>%
group_by(ident1) %>%
sample_frac(.2) # randomly select the 20 % from each group
Using plyr:
random selection of observations:
ddply(df, .(ident1), function(x, howmany) {
x[sample(seq_len(nrow(x)), howmany), ]
}, howmany = 20)
selecting the first 20 obs:
ddply(df, .(ident1), head, 20)
A base R option to get the first 20 rows per ident1, though not as efficient as data.table or dplyr, would be:
df[ave(seq_along(df$ident1), df$ident1, FUN = seq_along) <= 20, ]
Related
I'm tryng to subset my total data (including all the other varibales) to an interval of zipcodes EXCLUDING a certain part of that interval. Quite new to R and can't get it to work. (Zipcode = postnr)
I have over 100 000 zipcodes (postnr) and want all values for individs in zipcode 10 000-12 999 and 15 600 - 16 800 in my dataset
Attempt 1
Datan <- subset(Data2, Data2$postnr >= 10000 & Data2$postnr <= 16880)
Datant <- subset(Datan, Datan$postnr >= 15600 & Datan$postnr < 13000)
Datan returns 31 3000 obs in 26 variabels and Datant returns 0 obs in 26 variabels..
Attempt 2
attach(Data2)
Data5 <- Data2 %>% filter(between(postnr, 10000, 12999) & between(postnr, 15600, 16880))
Data 5 returns 0 obsverations...
I have thousands of values for all my variables inside those intervals. What am I doing wrong?
If you think about and versus or you have gotten it. As it is, you're really close!
Can a number be between 1 and 2 and 3 and 5? Nope. But if I said, can a number be between 1 and 2 or 3 and 5? Yup.
Updated
For subset:
Datan <- subset(Data2, postnr >= 10000 & postnr <= 13000 |
postnr >= 15600 & postnr < 16800)
Where that verticle pipe: | means 'or'.
For dplyr:
(I assume it's dplyr with filter.) You don't need to attach the data, it will extract the variable names from Data2 if it's in the pipe (which it is).
Data5 <- Data2 %>% filter(between(postnr, 10000, 12999) |
between(postnr, 15600, 16880))
I have no data, so I can not properly test this, but the following should work.
Note the or operator (|) to specify two different conditions.
library(data.table)
dt <- as.data.table(Data2)
dt[(postnr>10000&postnr<13000)|(postnr>15600&postnr<=16880),]
I am new to R and am running into difficulty with more advanced filtering. I have a data frame containing 1500 rows of people in households and need to filter out everyone who is part of a household where at least 1 person is older than 24. For example, in the sample set below I would only want to keep rows 3,4, and 5.
PersonalID DOB HouseholdID
1 1961-04-15 123
2 2017-01-12 123
3 2000-01-02 122
4 2001-03-05 122
5 1996-08-22 122
Initially I just filtered to get a new data frame with everyone in that age range and then filtered the original data frame again (and again and again and so on...) with each HouseholdID of someone under 25 to check if anyone else with that HouseholdID is over 24.
Whenever I'm doing the same thing over and over it seems like there's probably a way to use a function instead but I'm having a hard time coming up with one that works. This is my current attempt but I know there's plenty wrong with it:
UNDER25df <- filter(df, DOB >= "yyyy-mm-dd")
for (UNDER25df$HouseholdID in df) {
if (all(df$DOB >= "yyyy-mm-dd")) {
view(filter(df, HouseholdID == "$HouseholdID"))
}
}
The error I get is:
unexpected '}' in "}"
but I'm pretty sure that I can nest an if statement in a for loop in R and that I've been careful about the positioning of the brackets so I don't know exactly what it's referring to.
What I'm not sure of is if I can iterate through a data frame in this way or if this even makes sense. I've read that vectoring might be better in general for advanced filtering but tried to read the documentation on it and couldn't really see how to make that jump to this problem. Does anyone have a suggestion or a direction I should be looking in?
You do not need a loop for this. Try
library(lubridate)
library(dplyr)
set.seed(1)
df <- tibble(DOB = Sys.Date() - sample(3000:12000, 6),
personalID = 1:6,
HouseholdID = c(1,1,2,2,2,3))
df$DOB
# grab householdID from all persons that are at least 24
oldies <- df[(lubridate::today() - lubridate::ymd(df$DOB)) > years(24),
"HouseholdID", TRUE]
# base R way
oldies <- df[as.Date(df$DOB) > as.Date("1993-2-10"),
"HouseholdID", TRUE]
# household members in a household with someone 24 or older
df %>%
filter(HouseholdID %in% oldies)
# household members in a household with noone 24 or older
df %>%
filter(!(HouseholdID %in% oldies))
I am not sure if you want keep the rows grouped by ID that all users are less than or equal to 24-year old. If so, then maybe you can try the code below
library(lubridate)
dfout <- subset(df, ave(floor(time_length(Sys.Date()-as.Date(DOB),"years"))<=24, HouseholdID, FUN = all))
If you really want to use for loop to make it, then the below is an example
dfout <- data.frame()
for (id in unique(df$HouseholdID)) {
subdf <- subset(df,HouseholdID == id)
if (with(subdf, all(floor(time_length(Sys.Date()-as.Date(DOB),"years"))<=24))) {
dfout <- rbind(dfout,subdf)
}
}
Both approaches above can give you the result shown as
> dfout
PersonalID DOB HouseholdID
3 3 2000-01-02 122
4 4 2001-03-05 122
5 5 1996-08-22 122
DATA
df <- structure(list(PersonalID = 1:5, DOB = c("1961-04-15", "2017-01-12",
"2000-01-02", "2001-03-05", "1996-08-22"), HouseholdID = c(123L,
123L, 122L, 122L, 122L)), class = "data.frame", row.names = c(NA,
-5L))
I am not sure if you want to select household where all the people are above 24 or at least one person is above 24. In any case, you can use subset with ave
subset(df, ave(as.integer(format(Sys.Date(), "%Y")) -
as.integer(format(DOB, "%Y")) >= 24, HouseholdID, FUN = any))
This selects households where at least one person is above 24. If you want to select households where all people are above 24 use all instead of any in FUN argument.
Similarly, using dplyr, we can use
library(dplyr)
df %>%
group_by(HouseholdID) %>%
filter(any(as.integer(format(Sys.Date(), "%Y")) -
as.integer(format(DOB, "%Y")) >= 24))
I have a df like this:
> df<-data.frame(Client.code =
c(100451,100451,100523,100523,100523,100525),dayref = c(24,30,15,13,17,5))
> df
Client.code dayref
1 100451 24
2 100451 30
3 100523 15
4 100523 13
5 100523 17
6 100525 5
It is a one-year distribution of payments period from issue.
Usign this data above and given a df2 like this:
Client.Code Days
1 100451 16
1 100523 16
1 100460 35
As i have enough data for a reasonable quantile prob. calculations.I will like to know how to build a loop for assing to every row in this df2 of days a quantile according with the first df.
We can use data.table
library(data.table)
setDT(df)[, .(Quantile = quantile(dayref)), Client.code]
Or with tidyverse
library(dplyr)
library(tidyr)
df %>%
group_by(Client.code) %>%
summarise(Quantile = list(quantile(dayref))) %>%
unnest
tapply(df$dayref, df$Client.code, quantile)
You can specify specific percentiles by adding a vector of them
tapply(df$dayref, df$Client.code, quantile, 1:19/20)
You may need to formulate like this
tapply(df$dayref, df$Client.code, quantile, probs = 1:19/20)
And you can add na.rm = TRUE as another argument if you might have NAs
I'm at the last stage of cleaning/organizing data and would appreciate suggestions for this step. I'm new to R and don't understand fully how dataframes or other data types work. (I'm trying to learn but have a project due so need a quick solution). I've imported the data from a CSV file.
I want to group instances with the same (date, ID1, ID2, ID3). I want the average of all stats in the output and also a new column with the number of instances grouped.
Note: ID3 contains . I'd like to rename these to "na" before grouping
I've tried solutions
tdata$ID3[is.na(tdata$ID3)] <- "NA"
tdata[["ID3"]][is.na(tdata[["ID3"]])] <- "NA"
But get Error:
In `[<-.factor`(`*tmp*`, is.na(tdata[["ID3"]]), value = c(3L, 3L, :
invalid factor level, NA generated
The data is:
date ID1 ID2 ID3 stat1 stat2 stat.3
1 12-03-07 abc123 wxy456 pqr123 10 20 30
2 12-03-07 abc123 wxy456 pqr123 20 40 60
3 10-04-07 bcd456 wxy456 hgf356 10 20 40
4 12-03-07 abc123 wxy456 pqr123 30 60 90
5 5-09-07 spa234 int345 <NA> 40 50 70
Desired Output
date ID1, ID2, ID3, n, stat1, stat2, stat 3
12-03-07 abc123, wxy456, pqr457, 3, 20, 40, 60
10-04-07 bcd456, wxy456, hgf356, 1, 10, 20, 40
05-09-07 spa234, int345, big234, 1 , 40, 50, 70
I tried this solution: How to merge multiple data.frames and sum and average columns at the same time in R
But I was not successful merging the columns which have to be grouped and tested for similarity.
DF <- merge(tdata$date, tdata$ID1, tdata$ID2, tdata$ID3, by = "Name", all = T)
Error in fix.by(by.x, x) : 'by' must specify uniquely valid columns
Finally, to generate the n column. Perhaps insert a rows of 1s and use the sum of the column while summarizing?
We can do this with dplyr. After grouping by the 'ID' columns, add 'date' and 'n' also in the grouping variables, and get the mean of 'stat' columns
library(dplyr)
df1 %>%
group_by(ID1, ID2, ID3) %>%
group_by(date = first(date), n =n(), add=TRUE) %>%
summarise_at(vars(matches("stat")), mean)
NOTE: Regarding change the 'NA' to 'big234', we can convert the 'ID3' to character class and change it before doing the above operation
df1$ID3 <- as.character(df1$ID3)
df1$ID3[is.na(df1$ID3)] <- "big234"
While I find the dplyr solution proposed by akrun very intuitive to use, there is also a nice data.table solution:
Similarly as akrun, I assume that the NA value has been converted to "big234" to get the desired result.
library(data.table)
# convert data.frame to data.table
data <- data.table(df1)
# return the desired output
data[, c(.N, lapply(.SD, mean)),
by = list(date, ID1,ID2, ID3)]
I am computing a dplyr::summarize across a dataframe of sales data.
I do a group-by (S,D,Y), then within each group, compute medians and means for weeks 5..43, then merge those back into the parent df. Variable X is sales. X is never NA (i.e. there are no explicit NAs anywhere in df), but if there is no data (as in, no sales) for that S,D,Y and set of weeks, there will simply be no row with those values in df (take it that means zero sales for that particular set of parameters). In other words, impute X=0 in any structurally missing rows (but I hope I don't need to melt/cast the original df, to avoid bloat. Similar to cast(fill....,add.missing=T) or caret::preProcess()).
Two questions about my code idiom:
Is it better to use summarize than dplyr::filter, because filter physically drops rows so I have to assign the results to df.tmp then left-join it back to the original df (as below)? Also, big subsetting expressions repeated on every single line of summarize computations make the code harder to read.
Should I worry (or not) about caching the rows or logical indices of the subsetting operation, in the general case where I might be computing say n=20 new summary variables?
Not all combinations of S,D,Y-groups and filter (for those weeks) have rows, so how to get the summarize to replace NA on any missing rows? Currently I do as below.
Sorry both the code and dataset are proprietary, but here's the code idiom, and below is code you should run first to generate sample-data:
# Compute median, mean of X across wks 5..43, for that set of S,D,Y-values
# Issue a) filter() or repeatedly use subset() within each calculation?
df.tmp <- df %.% group_by(S,D,Y) %.% filter(Week>=5 & Week<=43) %.%
summarize(ysd_med543_X = median(X),
ysd_mean543_X = mean(X)
) %.% ungroup()
# Issue b) how to replace NAs in groups where the group_by-and-filter gave empty output?
# can you merge this code with the summarize above?
df <- left_join(df, df.tmp, copy=F)
newcols <- match(c('ysd_mean543_X','ysd_med543_X'), names(df))
df[!complete.cases(df[,newcols]), newcols] <- c(0.0,0.0)
and run this first to generate sample-data:
set.seed(1234)
rep_vector <- function(vv, n) {
unlist(as.vector(lapply(vv, function(...) {rep(...,n)} )))
}
n=7
m=3
df = data.frame(S = rep_vector(10:12, n), D = 20:26,
Y = rep_vector(2005:2007, n),
Week = round(52*runif(m*n)),
X = 4e4*runif(m*n) + 1e4 )
# Now drop some rows, to model structurally missing rows
I <- sort(sample(1:nrow(df),0.6*nrow(df)))
df = df[I,]
require(dplyr)
I don't think this has anything to do with the feature you've linked under comments (because IIUC that feature has to do with unused factor levels). Once you filter your data, IMO summarise should not (or rather can't?) be including them in the results (with the exception of factors). You should clarify this with the developers on their project page.
I'm by no means a dplyr expert, but I think, firstly, it'd be better to filter first followed by group_by + summarise. Else, you'll be filtering for each group, which is unnecessary. That is:
df.tmp <- df %.% filter(Week>=5 & Week<=43) %.% group_by(S,D,Y) %.% ...
This is just so that you're aware of it for any future cases.
IMO, it's better to use mutate here instead of summarise, as it'll remove the need for left_join, IIUC. That is:
df.tmp <- df %.% group_by(S,D,Y) %.% mutate(
md_X = median(X[Week >=5 & Week <= 43]),
mn_X = mean(X[Week >=5 & Week <= 43]))
Here, still we've the issue of replacing the NA/NaN. There's no easy/direct way to sub-assign here. So, you'll have to use ifelse, once again IIUC. But that'd be a little nicer if mutate supports expressions.
What I've in mind is something like:
df.tmp <- df %.% group_by(S,D,Y) %.% mutate(
{ tmp = Week >= 5 & Week <= 43;
md_X = ifelse(length(tmp), median(X[tmp]), 0),
md_Y = ifelse(length(tmp), mean(X[tmp]), 0)
})
So, we'll have to workaround in this manner probably:
df.tmp = df %.% group_by(S,D,Y) %.% mutate(tmp = Week >=5 & Week <= 43)
df.tmp %.% mutate(md_X = ifelse(tmp[1L], median(X), 0),
mn_X = ifelse(tmp[1L], mean(X), 0))
Or to put things together:
df %.% group_by(S,D,Y) %.% mutate(tmp = Week >=5 & Week <= 43,
md_X = ifelse(tmp[1L], median(X), 0),
mn_X = ifelse(tmp[1L], median(X), 0))
# S D Y Week X tmp md_X mn_X
# 1 10 20 2005 6 22107.73 TRUE 22107.73 22107.73
# 2 10 23 2005 32 18751.98 TRUE 18751.98 18751.98
# 3 10 25 2005 33 31027.90 TRUE 31027.90 31027.90
# 4 10 26 2005 0 46586.33 FALSE 0.00 0.00
# 5 11 20 2006 12 43253.80 TRUE 43253.80 43253.80
# 6 11 22 2006 27 28243.66 TRUE 28243.66 28243.66
# 7 11 23 2006 36 20607.47 TRUE 20607.47 20607.47
# 8 11 24 2006 28 22186.89 TRUE 22186.89 22186.89
# 9 11 25 2006 15 30292.27 TRUE 30292.27 30292.27
# 10 12 20 2007 15 40386.83 TRUE 40386.83 40386.83
# 11 12 21 2007 44 18049.92 FALSE 0.00 0.00
# 12 12 26 2007 16 35856.24 TRUE 35856.24 35856.24
which doesn't require df.tmp.
HTH