This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
I have data, stored as a data.table dt or a data.frame df, with multiple observations per id-month combination. I want to store the row number in a variable, let's call it row.
I know how to do this in dplyr but want to learn how to do it in (pure) data.table. I assume it is a trivially easy operation, but I can't seem to find a solution that works.
Reprex:
library(dplyr)
library(data.table)
df <- data_frame(id = c(1, 1, 1, 2, 2, 2), month = c(1, 1, 2, 1, 1, 2))
dt <- data.table(df)
My dplyr solution gives the expected output:
df %>%
group_by(id, month) %>%
mutate(row = row_number(id))
# A tibble: 6 x 3
# Groups: id, month [4]
id month row
<dbl> <dbl> <int>
1 1 1 1
2 1 1 2
3 1 2 1
4 2 1 1
5 2 1 2
6 2 2 1
Doing similar operations on a data.table yields something different:
dt[, row := row_number(id), by = c("id", "month")]
id month row
1: 1 1 1
2: 1 1 1
3: 1 2 1
4: 2 1 1
5: 2 1 1
6: 2 2 1
Or:
dt[, row := .I, by = c("id", "month")]
id month row
1: 1 1 1
2: 1 1 2
3: 1 2 3
4: 2 1 4
5: 2 1 5
6: 2 2 6
I assume I understand why this happens (row_number(id) simply refers to the first row number of the first row of each group), but do not know how to get the expected result in pure data.table.
dt[, row := row.names(dt), by = c("id", "month")]
dt
id month row
1: 1 1 1
2: 1 1 2
3: 1 2 1
4: 2 1 1
5: 2 1 2
6: 2 2 1
Related
How can I run a loop over multiple columns changing consecutive values to true values?
For example, if I have a dataframe like this...
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1
I want to show the binned values...
Time Value Bin Subject_ID
1 6 1 1
2 4 2 1
4 8 3 1
1 2 4 1
Is there a way to do it in a loop?
I tried this code...
for (row in 2:nrow(df)) {
if(df[row - 1, "Subject_ID"] == df[row, "Subject_ID"]) {
df[row,1:2] = df[row,1:2] - df[row - 1,1:2]
}
}
But the code changed it line by line and did not give the correct values for each bin.
If you still insist on using a for loop, you can use the following solution. It's very simple but you have to first create a copy of your data set as your desired output values are the difference of values between rows of the original data set. In order for this to happen we move DF outside of the for loop so the values remain intact, otherwise in every iteration values of DF data set will be replaced with the new values and the final output gives incorrect results:
df <- read.table(header = TRUE, text = "
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1")
DF <- df[, c("Time", "Value")]
for(i in 2:nrow(df)) {
df[i, c("Time", "Value")] <- DF[i, ] - DF[i-1, ]
}
df
Time Value Bin Subject_ID
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
The problem with the code in the question is that after row i is changed the changed row is used in calculating row i+1 rather than the original row i. To fix that run the loop in reverse order. That is use nrow(df):2 in the for statement. Alternately try one of these which do not use any loops and also have the advantage of not overwriting the input -- something which makes the code easier to debug.
1) Base R Use ave to perform Diff by group where Diff uses diff to actually perform the differencing.
Diff <- function(x) c(x[1], diff(x))
transform(df,
Time = ave(Time, Subject_ID, FUN = Diff),
Value = ave(Value, Subject_ID, FUN = Diff))
giving:
Time Value Bin Subject_ID
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
2) dplyr Using dplyr we write the above except we use lag:
library(dplyr)
df %>%
group_by(Subject_ID) %>%
mutate(Time = Time - lag(Time, default = 0),
Value = Value - lag(Value, default = 0)) %>%
ungroup
giving:
# A tibble: 4 x 4
Time Value Bin Subject_ID
<dbl> <dbl> <int> <int>
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
or using across:
library(dplyr)
df %>%
group_by(Subject_ID) %>%
mutate(across(Time:Value, ~ .x - lag(.x, default = 0))) %>%
ungroup
Note
Lines <- "Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1"
df <- read.table(text = Lines, header = TRUE)
Here is a base R one-liner with diff in a lapply loop.
df[1:2] <- lapply(df[1.2], function(x) c(x[1], diff(x)))
df
# Time Value Bin Subject_ID
#1 1 1 1 1
#2 2 2 2 1
#3 4 4 3 1
#4 1 1 4 1
Data
df <- read.table(text = "
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1
", header = TRUE)
dplyr one liner
library(dplyr)
df %>% mutate(across(c(Time, Value), ~c(first(.), diff(.))))
#> Time Value Bin Subject_ID
#> 1 1 6 1 1
#> 2 2 4 2 1
#> 3 4 8 3 1
#> 4 1 2 4 1
I have a larger data frame that has multiple columns and thousands of rows. I want to replace the value of every lead row by subtracting the previous row value from the lead row for every five rows of the data frame. For example, the first value should retain its value, the second row should be: second row - first row. Similarly, the sixth row should retain its value, however, the seventh row would be seventh row - sixth row. Here is an example data frame
DF = data.frame(A= c(1:11), B = c(11:21))
The outputput should be like below
> Output
A B
1 1 11
2 1 1
3 1 1
4 1 1
5 1 1
6 6 16
7 1 1
8 1 1
9 1 1
10 1 1
11 11 21
One option would be to create a grouping variable and then do the transformation with diff which does the difference of adjacent elements of the columns selected in mutate_all (if only a subset of columns are needed either use mutate_if or mutate_at)
library(dplyr) #v_0.8.3
DF %>%
group_by(grp = as.integer(gl(n(), 5, n()))) %>%
mutate_all(~c(first(.), diff(.))) %>%
ungroup %>%
select(-grp)
# A tibble: 11 x 2
# A B
# <int> <int>
# 1 1 11
# 2 1 1
# 3 1 1
# 4 1 1
# 5 1 1
# 6 6 16
# 7 1 1
# 8 1 1
# 9 1 1
#10 1 1
#11 11 21
The above also gives a warning when we use mutate_all after group_by (previously it used to work - in the new versions, the correct syntax would be to use mutate_at
DF %>%
group_by(grp = as.integer(gl(n(), 5, n()))) %>%
mutate_at(vars(-group_cols()), ~c(first(.), diff(.))) %>%
ungroup %>%
select(-grp)
f = function(d, n = 5) ave(d, ceiling(seq_along(d)/n), FUN = function(x) c(x[1], diff(x)))
data.frame(lapply(DF, f))
# A B
#1 1 11
#2 1 1
#3 1 1
#4 1 1
#5 1 1
#6 6 16
#7 1 1
#8 1 1
#9 1 1
#10 1 1
#11 11 21
Another option would be to create another data.frame with shifted rows and subtract directly
ind = ave(1:nrow(DF), ceiling(1:nrow(DF)/5), FUN = function(x) c(x[1], x[-length(x)]))
DF2 = DF[ind,] * replace(rep(1, nrow(DF)), diff(ind) == 0, 0)
DF - DF2
You can %/% the row number minus 1 by 5 to get the groups, then use diff to get the difference from the previous x (or 0 if there is no previous x) from x for all columns x for each group.
library(data.table)
setDT(DF)
DF[, lapply(.SD, function(x) diff(c(0, x)))
, (1:nrow(DF) - 1) %/% 5][, -1]
# A B
# 1: 1 11
# 2: 1 1
# 3: 1 1
# 4: 1 1
# 5: 1 1
# 6: 6 16
# 7: 1 1
# 8: 1 1
# 9: 1 1
# 10: 1 1
# 11: 11 21
Or, as mentioned by #akrun, you could avoid lapply by replacing
lapply(.SD, function(x) diff(c(0, x)))
with
.SD - shift(.SD, fill = 0)
Another less serious option:
x <- DF[, !(.I - 1) %% 5]
DF*(1 + x) - DF[DF[, .I - !x]]
# A B
# 1: 1 11
# 2: 1 1
# 3: 1 1
# 4: 1 1
# 5: 1 1
# 6: 6 16
# 7: 1 1
# 8: 1 1
# 9: 1 1
# 10: 1 1
# 11: 11 21
I have a data with about 1000 groups each group is ordered from 1-100(can be any number within 100).
As I was looking through the data. I found that some groups had bad orders, i.e., it would order to 100 then suddenly a 24 would show up.
How can I delete all of these error data
As you can see from the picture above(before -> after), I would like to find all rows that don't follow the order within the group and just delete it.
Any help would be great!
lag will compute the difference between the current value and the previous value, diff will be used to select only positive difference i.e. the current value is greater than the previous value. min is used as lag give the first value NA. I keep the helper column diff to check, but you can deselect using %>% select(-diff)
library(dplyr)
df1 %>% group_by(gruop) %>% mutate(diff = order-lag(order)) %>%
filter(diff >= 0 | order==min(order))
# A tibble: 8 x 3
# Groups: gruop [2]
gruop order diff
<int> <int> <int>
1 1 1 NA
2 1 3 2
3 1 5 2
4 1 10 5
5 2 1 NA
6 2 4 3
7 2 4 0
8 2 8 4
Data
df1 <- read.table(text="
gruop order
1 1
1 3
1 5
1 10
1 2
2 1
2 4
2 4
2 8
2 3
",header=T, stringsAsFactors = F)
Assuming the order column increments by 1 every time we can use ave where we remove those rows which do not have difference of 1 with the previous row by group.
df[!ave(df$order, df$group, FUN = function(x) c(1, diff(x))) != 1, ]
# group order
#1 1 1
#2 1 2
#3 1 3
#4 1 4
#6 2 1
#7 2 2
#8 2 3
#9 2 4
EDIT
For the updated example, we can just change the comparison
df[ave(df$order, df$group, FUN = function(x) c(1, diff(x))) >= 0, ]
Playing with data.table:
library(data.table)
setDT(df1)[, diffo := c(1, diff(order)), group][diffo == 1, .(group, order)]
group order
1: 1 1
2: 1 2
3: 1 3
4: 1 4
5: 2 1
6: 2 2
7: 2 3
8: 2 4
Where df1 is:
df1 <- data.frame(
group = rep(1:2, each = 5),
order = c(1:4, 2, 1:4, 3)
)
EDIT
If you only need increasing order, and not steps of one then you can do:
df3 <- transform(df1, order = c(1,3,5,10,2,1,4,7,9,3))
setDT(df3)[, diffo := c(1, diff(order)), group][diffo >= 1, .(group, order)]
group order
1: 1 1
2: 1 3
3: 1 5
4: 1 10
5: 2 1
6: 2 4
7: 2 7
8: 2 9
Let's say I have the following simple data.frame:
ID value
1 1 3
2 2 4
3 1 5
4 3 3
My desired output is below, where we add a value to cumsum or we update it according to the latest value of an already used ID.
ID value cumsum
1 1 3 3
2 2 4 7
3 1 5 9
4 3 3 12
In row 3, the new value forms an updated cumsum (7-3+5=9). Row 4 adds a new value to cumsum because the ID was not used before (4+5+3=12).
This produces the desired outcome for your example:
df<-read.table(header=T, text="ID value
1 1 3
2 2 4
3 1 5
4 3 3")
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(value = value-lag(value, def = 0L)) %>%
ungroup %>% mutate(cumsum = cumsum(value))
# # A tibble: 4 x 3
# ID value cumsum
# <int> <int> <int>
# 1 1 3 3
# 2 2 4 7
# 3 1 2 9
# 4 3 3 12
I used data.table for cumsum. Calculating the cumulative mean is a bit more tricky because the number of oberservations is not adjusted by just using cummean.
library(data.table)
dt = data.table(id = c(1, 2, 1, 3), value = c(3, 4, 5, 3))
dt[, tmp := value-shift(value, n = 1L, type = "lag", fill = 0), by = c("id")]
#CUMSUM
dt[, cumsum := cumsum(tmp)]
#CUMMEAN WITH UPDATED N
dt[value != tmp, skip := 1:.N]
dt[, skip := na.locf(skip, na.rm = FALSE)]
dt[is.na(skip), skip := 0]
dt[, cummean := cumsum(tmp)/(seq_along(tmp)-skip)]
Output is:
id value tmp cumsum skip cummean
1: 1 3 3 3 0 3.0
2: 2 4 4 7 0 3.5
3: 1 5 2 9 1 4.5
4: 3 3 3 12 1 4.0
Edit: Changed lag function to data.table's shift function.
I need to fill $Year with missing values of the sequence by the factor of $Country. The $Count column can just be padded out with 0's.
Country Year Count
A 1 1
A 2 1
A 4 2
B 1 1
B 3 1
So I end up with
Country Year Count
A 1 1
A 2 1
A 3 0
A 4 2
B 1 1
B 2 0
B 3 1
Hope that's clear guys, thanks in advance!
This is a dplyr/tidyr solution using complete and full_seq:
library(dplyr)
library(tidyr)
df %>% group_by(Country) %>% complete(Year=full_seq(Year,1),fill=list(Count=0))
Country Year Count
<chr> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
library(data.table)
# d is your original data.frame
setDT(d)
foo <- d[, .(Year = min(Year):max(Year)), Country]
res <- merge(d, foo, all.y = TRUE)[is.na(Count), Count := 0]
Similar to #PoGibas' answer:
library(data.table)
# set default values
def = list(Count = 0L)
# create table with all levels
fullDT = setkey(DT[, .(Year = seq(min(Year), max(Year))), by=Country])
# initialize to defaults
fullDT[, names(def) := def ]
# overwrite from data
fullDT[DT, names(def) := mget(sprintf("i.%s", names(def))) ]
which gives
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 0
4: A 4 2
5: B 1 1
6: B 2 0
7: B 3 1
This generalizes to having more columns (besides Count). I guess similar functionality exists in the "tidyverse", with a name like "expand" or "complete".
Another base R idea can be to split on Country, use setdiff to find the missing values from the seq(max(Year)), and rbind them to original data frame. Use do.call to rbind the list back to a data frame, i.e.
d1 <- do.call(rbind, c(lapply(split(df, df$Country), function(i){
x <- rbind(i, data.frame(Country = i$Country[1],
Year = setdiff(seq(max(i$Year)), i$Year),
Count = 0));
x[with(x, order(Year)),]}), make.row.names = FALSE))
which gives,
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
> setkey(DT,Country,Year)
> DT[setkey(DT[, .(min(Year):max(Year)), by = Country], Country, V1)]
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 NA
4: A 4 2
5: B 1 1
6: B 2 NA
7: B 3 1
Another dplyr and tidyr solution.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(Country) %>%
do(data_frame(Country = unique(.$Country),
Year = full_seq(.$Year, 1))) %>%
full_join(dt, by = c("Country", "Year")) %>%
replace_na(list(Count = 0))
Here is an approach in base R that uses tapply, do.call, range, and seq, to calculate year sequences. Then constructs a data.frame from the named list that is returned, merges this onto the original which adds the desired rows, and finally fills in missing values.
# get named list with year sequences
temp <- tapply(dat$Year, dat$Country, function(x) do.call(seq, as.list(range(x))))
# construct data.frame
mydf <- data.frame(Year=unlist(temp), Country=rep(names(temp), lengths(temp)))
# merge onto original
mydf <- merge(dat, mydf, all=TRUE)
# fill in missing values
mydf[is.na(mydf)] <- 0
This returns
mydf
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1