Create new index / re-index in dplyr [duplicate] - r

This question already has answers here:
How to number/label data-table by group-number from group_by?
(6 answers)
Closed 6 years ago.
I am using a dplyr table in R. Typical fields would be a primary key, an id number identifying a group, a date field, and some values. There are numbersI did some manipulation that throws out a bunch of data in some preliminary steps.
In order to do the next step of my analysis (in MC Stan), It'll be easier if both the date and the group id fields are integer indices. So basically, I need to re-index them as integers between 1 and whatever the total number of distinct elements are (about 750 for group_id and about 250 for date_id, the group_id is already integer, but the date is not). This is relatively straightforward to do after exporting it to a data frame, but I was curious if it is possible in dplyr.
My attempt at creating a new date_val (called date_val_new) is below. Per the discussion in the comments I have some fake data. I purposefully made the group and date values not be 1 to whatever, but I didn't make the date an actual date. I made the data unbalanced, removing some values to illustrate the issue. The dplyr command re-starts the index at 1 for each new group, regardless of what date_val it is. So every group starts at 1, even if the date is different.
df1 <- data.frame(id = 1:40,
group_id = (10 + rep(1:10, each = 4)),
date_val = (20 + rep(rep(1:4), 10)),
val = runif(40))
for (i in c(5, 17, 33))
{
df1 <- df1[!df1$id == i, ]
}
df_new <- df1 %>%
group_by(group_id) %>%
arrange(date_val) %>%
mutate(date_val_new=row_number(group_id)) %>%
ungroup()

This is the base R method:
df1 %>% mutate(date_val_new = match(date_val, unique(date_val)))
Or with a data.table, df1[, date_val_new := .GRP, by=date_val].

Use group_indices_() to generate a unique id for each group:
df1 %>% mutate(date_val_new = group_indices_(., .dots = "date_val"))
Update
Since group_indices() does not handle class tbl_postgres, you could try dense_rank()
copy_to(my_db, df1, name = "df1")
tbl(my_db, "df1") %>%
mutate(date_val_new = dense_rank(date_val))
Or build a custom query using sql()
tbl(my_db, sql("SELECT *,
DENSE_RANK() OVER (ORDER BY date_val) AS DATE_VAL_NEW
FROM df1"))

Alternatively, I think you can try getanID() from the splitstackshape package.
library(splitstackshape)
getanID(df1, "group_id")[]
# id group_id date_val val .id
# 1: 1 11 21 0.01857242 1
# 2: 2 11 22 0.57124557 2
# 3: 3 11 23 0.54318903 3
# 4: 4 11 24 0.59555088 4
# 5: 6 12 22 0.63045007 1
# 6: 7 12 23 0.74571297 2
# 7: 8 12 24 0.88215668 3

Related

Recode values based on look up table with dplyr (R)

A relatively trivial question that has been bothering me for a while, but to which I have not yet found an answer - perhaps because I have trouble verbalizing the problem for search engines.
Here is a column of a data frame that contains identifiers.
data <- data.frame("id" = c("D78", "L30", "F02", "A23", "B45", "T01", "Q38", "S30", "K84", "O04", "P12", "Z33"))
Based on a lookup table, outdated identifiers are to be recoded into new ones. Here is an example look up table.
recode_table <- data.frame("old" = c("A23", "B45", "K84", "Z33"),
"new" = c("A24", "B46", "K88", "Z33"))
What I need now can be done with a merge or a loop. Here a loop example:
for(ID in recode_table$old) {
data[data$id == ID, "id"] <- recode_table[recode_table$old == ID, "new"]
}
But I am looking for a dplyr solution without having to use the " join" family. I would like something like this.
data <- mutate(data, id = ifelse(id %in% recode_table$old, filter(recode_table, old == id) %>% pull(new), id))
Obviously though, I can't use the column name ("id") of the table in order to identify the new ID.
References to corresponding passages in documentations or manuals are also appreciated. Thanks in advance!
You can use recode with unquote splicing (!!!) on a named vector
library(dplyr)
# vector of new IDs
recode_vec <- recode_table$new
# named with old IDs
names(recode_vec) <- recode_table$old
data %>%
mutate(id = recode(id, !!!recode_vec))
# id
# 1 D78
# 2 L30
# 3 F02
# 4 A24
# 5 B46
# 6 T01
# 7 Q38
# 8 S30
# 9 K88
# 10 O04
# 11 P12
# 12 Z33

R - Removing the same name in two columns of a data frame

I am working with a data frame that has two columns, name and spouse. I am trying to calculate the interracial marriage frequency, but I need to remove repeated registers.
When I have the name of a creature I need to keep this register in the data frame but remove the register where that creature name is the spouse name. I have this following data sample:
name spouse
15 Finarfin Eärwen
6 Tar-Vanimeldë Herucalmo
17 Faramir owyn
8 Tar-Meneldur Almarian
14 Finduilas of Dol Amroth Denethor II
12 Finwë Míriel Serindë then ,Indis
9 Tar-Ancalimë Hallacar
7 Tar-Míriel Ar-Pharazôn
5 Tarannon Falastur Berúthiel
21 Rufus Burrows Asphodel Brandybuck
2 Angrod Eldalótë
4 Ar-Gimilzôr Inzilbêth
19 Lobelia Sackville-Baggins Otho Sackville-Baggins
25 Mrs. Proudfoot Odo Proudfoot
22 Rudigar Bolger Belba Baggins
24 Odo Proudfoot Mrs. Proudfoot
3 Ar-Pharazôn Tar-Míriel
13 Fingolfin Anairë
18 Silmariën Elatan
23 Rowan Greenhand Belba Baggins
20 Rían Huor
1 Adanel Belemir
16 Fastolph Bolger Pansy Baggins
10 Morwen Steelsheen Thengel
11 Tar-Aldarion Erendis
25 Belemir Adanel
For example, I ran the code and in line 1 it caught name Adanel and got Belemir as its spouse, so I need to keep line 1, but remove line 25, because with that I will avoid duplicated data.
I have tried this following code:
interacialMariage <-data %>% filter(spouse != name) %>% select(name, spouse)
How can I get the same spouse name register out of the data frame registers?
P.S.: I would need it to avoid case sensitive (Belemir == belemir) so that I don't have problems in the future.
Thanks!
You could set up another vector with the row-wise alphabetically sorted names, and deduplicate using that...
sorted <- sapply(1:nrow(data),
function(i) paste(sort(c(trimws(tolower(data$name[i])),
trimws(tolower(data$spouse[i])))),
collapse=" "))
irM <- data[!duplicated(sorted),]
The trimws strips off any leading or trailing spaces before sorting and pasting, and tolower converts everything to lower case.
My attempt with tidyverse:
library(tidyverse)
dat %>%
mutate(id = 1:n()) %>% # add id to label the pairs
gather('key', 'name', -id) %>% # transform: key (name | spouse), name, id
group_by(name) %>% # group by unique name to find duplicated
top_n(-1, wt = id) %>% # if name > 1, take row with the lower id
spread(key, name) %>% # spread data to original format
select(-id) # remove id's
# # A tibble: 3 x 2
# name spouse
# <chr> <chr>
# 1 Adanel Belemir
# 2 Fastolph Bolger Pansy Baggins
# 3 Morwen Steelsheen Thengel
Data:
dat <- data.frame(
name = c("Adanel", "Fastolph Bolger", "Morwen Steelsheen", "Belemir"),
spouse = c("Belemir", "Pansy Baggins", "Thengel", "Adanel" ),
stringsAsFactors = F
)

R count and substract events from a data frame

I am trying to calculate the families sizes from a data frame, which also contains two types of events : family members who died, and those who left the family. I would like to take into account these two parameters in order to compute the actual family size.
Here is a reproductive example of my problem, with 3 families only :
family <- factor(rep(c("001","002","003"), c(10,8,15)), levels=c("001","002","003"), labels=c("001","002","003"), ordered=TRUE)
dead <- c(0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0)
left <- c(0,0,0,0,0,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1,1,0,0,0,0,0,0,1,1,1,0,0)
DF <- data.frame(family, dead, left) ; DF
I could count N = total family members (in each family) in a second dataframe DF2, by simply using table()
DF2 <- with(DF, data.frame(table(family)))
colnames(DF2)[2] <- "N" ; DF2
family N
1 001 10
2 002 8
3 003 15
But i can not find a proper way to get the actual number of people (for example, creating a new variable N2 into DF2) , calculated by substracting to N the number of members who died or left the family. I suppose i have to relate the two dataframes DF and DF2 in a way. i have looked for other related questions in this site but could not find the right answer...
If anyone has a good idea, it would be great !
Thank you in advance..
Deni
Logic : First we want to group_by(family) and then calculate 2 numbers : i) total #obs in each group ii) subtract the sum(dead) + sum(left) from this total .
In dplyr package : n() helps us get the total #observations in each group
In data.table : .N does the same above job
library(dplyr)
DF %>% group_by(family) %>% summarise( total = n(), current = n()-sum(dead,left, na.rm = TRUE))
# family total current
# (fctr) (int) (dbl)
#1 001 10 6
#2 002 8 4
#3 003 15 7
library(data.table)
# setDT() is preferred if incase your data was a data.frame. else just DF.
setDT(DF)[, .(total = .N, current = .N - sum(dead, left, na.rm = TRUE)), by = family]
# family total current
#1: 001 10 6
#2: 002 8 4
#3: 003 15 7
Here is a base R option
do.call(data.frame, aggregate(dl~family, transform(DF, dl = dead + left),
FUN = function(x) c(total=length(x), current=length(x) - sum(x))))
Or a modified version is
transform(aggregate(. ~ family, transform(DF, total = 1,
current = dead + left)[c(1,4:5)], FUN = sum), current = total - current)
# family total current
#1 001 10 6
#2 002 8 4
#3 003 15 7
I finally found another which works fine (from another post), allowing to compute everything from the original DF table. This uses the ddply function :
DF <- ddply(DF,.(family),transform,total=length(family))
DF <- ddply(DF,.(family),transform,actual=length(family)-sum(dead=="1")-sum(left=="1"))
DF
Thanks a lot to everyone who helped ! Deni

Apply function to each column of dataframe

I have the following (very large) dataframe:
id epoch
1 0 1.141194e+12
2 1 1.142163e+12
3 2 1.142627e+12
4 2 1.142627e+12
5 3 1.142665e+12
6 3 1.142665e+12
7 4 1.142823e+12
8 5 1.143230e+12
9 6 1.143235e+12
10 6 1.143235e+12
For every unique ID, I now want to get the difference between its maximum and minimum time (epoch timestamp). There are IDs with many more occurences than in the example above, in case it is relevant. I haven't worked much with R yet and tried the following:
unique = data.frame(as.numeric(unique(df$id)))
differences = apply(unique, 1, get_duration)
get_duration = function(id) {
maxTime = max(df$epoch[which(df$id == id)])
minTime = min(df$epoch[which(df$id == id)])
return ((maxTime - minTime) / 1000)
}
It works, but is incredibly slow. What would be a faster approach?
A couple of approaches. In base R:
tapply(df$epoch,df$id,function(x) (max(x)-min(x))/1000)
With data.table:
require(data.table)
setDT(df)
df[,list(d=(max(epoch)-min(epoch))/1000),by=id]
This can be done easily in dplyr
require(dplyr)
df %>% group_by(id) %>% summarize(diff=(max(epoch)-min(epoch))/1000)
Use the filter by id just once
subset = df$epoch[which(df$id == id)]
maxTime = max(subset)
minTime = min(subset)

Count based on multiple conditions from other data.frame

I am migrating analysis from Excel to R, and would like some input on how best to perform something similar to Excel's COUNTIFS in R.
I have a two data.frames, statedf and memberdf.
statedf=data.frame(state=c('MD','MD','MD','NY','NY','NY'), week = 5:7)
memberdf=data.frame(memID = 1:15, state = c('MD','MD','NY','NY','MD'),
finalweek = c(3,3,5,3,3,5,3,5,3,5,6,5,2,3,5),
orders = c(1,2,3))
This data is for a subscription-based business. I would like to know the number of members who newly lapsed for each week/state combo in statedf, where newly lapse is defined by statedf$week - 1 = memberdf$finalweek. Further I would like to have separate counts for each order value (1,2,3).
The desired output would look like
out <- data.frame(state=c('MD','MD','MD','NY','NY','NY'), week = 5:7,
oneorder = c(0,1,0,0,0,0),
twoorder = c(0,0,1,0,1,0),
threeorder = c(0,3,0,0,1,0))
I asked (and got a great response for) a simpler version of this question yesterday - the answers revolved around creating a new data.frame based on member.df. However, I need to append the data to statedf, because statedf has member/week combos that don't exist in memberdf, and vice versa. If this was in Excel, I'd use COUNTIFS but am struggling for a solution in R.
Thanks.
Here is a solution with the dplyr and tidyr packages:
library(tidyr) ; library(dplyr)
counts <- memberdf %>%
mutate(lapsedweek = finalweek + 1) %>%
group_by(state, lapsedweek, orders) %>%
tally()
counts <- counts %>% spread(orders, n, fill = 0)
out <- left_join(statedf, counts, by = c("state", "week" = "lapsedweek"))
out[is.na(out)] <- 0 # convert rows with all NAs to 0s
names(out)[3:5] <- paste0("order", names(out)[3:5]) # rename columns
We could create a new variable ('week1') in the 'statedf' dataset, merge the 'memberdf' with 'statedf', and then reshape from 'long' to 'wide' format with dcast. I changed the 'orders' column to match the column names in the 'out'.
statedf$week1 <- statedf$week-1
df1 <- merge(memberdf[-1], statedf, by.x=c('state', 'finalweek'),
by.y=c('state', 'week1'), all.y=TRUE)
lvls <- paste0(c('one', 'two', 'three'), 'order')
df1$orders <- factor(lvls[df1$orders],levels=lvls)
library(reshape2)
out1 <- dcast(df1, state+week~orders, value.var='orders', length)[-6]
out1
# state week oneorder twoorder threeorder
#1 MD 5 0 0 0
#2 MD 6 1 0 3
#3 MD 7 0 1 0
#4 NY 5 0 0 0
#5 NY 6 0 1 1
#6 NY 7 0 0 0
all.equal(out, out1)
#[1] TRUE

Resources