I need to fill $Year with missing values of the sequence by the factor of $Country. The $Count column can just be padded out with 0's.
Country Year Count
A 1 1
A 2 1
A 4 2
B 1 1
B 3 1
So I end up with
Country Year Count
A 1 1
A 2 1
A 3 0
A 4 2
B 1 1
B 2 0
B 3 1
Hope that's clear guys, thanks in advance!
This is a dplyr/tidyr solution using complete and full_seq:
library(dplyr)
library(tidyr)
df %>% group_by(Country) %>% complete(Year=full_seq(Year,1),fill=list(Count=0))
Country Year Count
<chr> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
library(data.table)
# d is your original data.frame
setDT(d)
foo <- d[, .(Year = min(Year):max(Year)), Country]
res <- merge(d, foo, all.y = TRUE)[is.na(Count), Count := 0]
Similar to #PoGibas' answer:
library(data.table)
# set default values
def = list(Count = 0L)
# create table with all levels
fullDT = setkey(DT[, .(Year = seq(min(Year), max(Year))), by=Country])
# initialize to defaults
fullDT[, names(def) := def ]
# overwrite from data
fullDT[DT, names(def) := mget(sprintf("i.%s", names(def))) ]
which gives
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 0
4: A 4 2
5: B 1 1
6: B 2 0
7: B 3 1
This generalizes to having more columns (besides Count). I guess similar functionality exists in the "tidyverse", with a name like "expand" or "complete".
Another base R idea can be to split on Country, use setdiff to find the missing values from the seq(max(Year)), and rbind them to original data frame. Use do.call to rbind the list back to a data frame, i.e.
d1 <- do.call(rbind, c(lapply(split(df, df$Country), function(i){
x <- rbind(i, data.frame(Country = i$Country[1],
Year = setdiff(seq(max(i$Year)), i$Year),
Count = 0));
x[with(x, order(Year)),]}), make.row.names = FALSE))
which gives,
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
> setkey(DT,Country,Year)
> DT[setkey(DT[, .(min(Year):max(Year)), by = Country], Country, V1)]
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 NA
4: A 4 2
5: B 1 1
6: B 2 NA
7: B 3 1
Another dplyr and tidyr solution.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(Country) %>%
do(data_frame(Country = unique(.$Country),
Year = full_seq(.$Year, 1))) %>%
full_join(dt, by = c("Country", "Year")) %>%
replace_na(list(Count = 0))
Here is an approach in base R that uses tapply, do.call, range, and seq, to calculate year sequences. Then constructs a data.frame from the named list that is returned, merges this onto the original which adds the desired rows, and finally fills in missing values.
# get named list with year sequences
temp <- tapply(dat$Year, dat$Country, function(x) do.call(seq, as.list(range(x))))
# construct data.frame
mydf <- data.frame(Year=unlist(temp), Country=rep(names(temp), lengths(temp)))
# merge onto original
mydf <- merge(dat, mydf, all=TRUE)
# fill in missing values
mydf[is.na(mydf)] <- 0
This returns
mydf
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
Related
I'd like to make a frequency count individually for multiple columns with same possible values. The idea is to keep all columns from original data table, just adding a new one for levels and aggregating.
Here is an example of input data:
foo <- data.table(a = c(1,3,2,3,3), b = c(2,3,3,1,1), c = c(3,1,2,3,2))
# a b c
#1: 1 2 3
#2: 3 3 1
#3: 2 3 2
#4: 3 1 3
#5: 3 1 2
And desired output:
data.table(levels = 1:3, a = c(1,1,3), b = c(2,1,2), c = c(1,2,2))
# levels a b c
#1: 1 1 2 1
#2: 2 1 1 2
#3: 3 3 2 2
Thanks for helping !
We may use
library(data.table)
dcast(melt(foo)[, .N, .(variable, levels = value)],
levels ~ variable, value.var = 'N')
-output
Key: <levels>
levels a b c
<num> <int> <int> <int>
1: 1 1 2 1
2: 2 1 1 2
3: 3 3 2 2
Or using base R
table(stack(foo))
ind
values a b c
1 1 2 1
2 1 1 2
3 3 2 2
You could also use recast from reshape2:
reshape2::recast(foo, value~variable)
# No id variables; using all as measure variables
# Aggregation function missing: defaulting to length
value a b c
1 1 1 2 1
2 2 1 1 2
3 3 3 2 2
or even
reshape2::recast(foo, value~variable, length)
Here is an option using purrr and dplyr from the tidyverse:
library(purrr)
library(dplyr)
foo %>%
imap(~ as.data.frame(table(.x, dnn = "levels"), responseName = .y)) %>%
reduce(left_join, by = "levels")
Alternatively, you could use the pivot functions from tidyr:
library(dplyr)
library(tidyr)
foo %>%
pivot_longer(everything(),
values_to = "levels") %>%
count(name, levels) %>%
pivot_wider(id_cols = levels,
names_from = name,
values_from = n)
foo |>
melt() |>
dcast(value ~ variable, fun.aggregate = length)
# value a b c
# 1: 1 1 2 1
# 2: 2 1 1 2
# 3: 3 3 2 2
For some reason, the i term in the for loop cannot be used as the grouping name. I have 40 elements in the for loop. I am showing just 2 here as an example.
data = data.table(id = c(1,1,1,1,1), a = c(1,1,2,3,NA), b = c(1,2,2,NA,3))
> data
id a b
1: 1 1 1
2: 1 1 2
3: 1 2 2
4: 1 3 NA
5: 1 NA 3
categories = data.table(CATEGORY = c(1,2,3,NA))
> categories
CATEGORY
1: 1
2: 2
3: 3
4: NA
What I have done:
for (i in colnames(data)[2:3]){
dt = data[, .N, i][order(i)]
setnames(dt, "N", i)
categories = cbind(categories, dt[,2])
}
> categories
CATEGORY a b
1: 1 2 1
2: 2 2 1
3: 3 2 1
4: NA 2 1
I have also tried the dplyr piping instead of the data.table .N and it did not work:
data %>% count(i)
What I need:
> categories
CATEGORY a b
1: 1 2 1
2: 2 1 2
3: 3 1 1
4: NA 1 1
You could reshape the data instead which would make this easier to calculate.
In data.table :
library(data.table)
long_data <- melt(data, 'id')
dcast(long_data[, .N, .(variable, value)], value~variable, value.var = 'N')
# value a b
#1: NA 1 1
#2: 1 2 1
#3: 2 1 2
#4: 3 1 1
Or in tidyverse :
library(dplyr)
library(tidyr)
data %>%
pivot_longer(cols = -id) %>%
count(name, value) %>%
pivot_wider(names_from = name, values_from = n)
I have a larger data frame that has multiple columns and thousands of rows. I want to replace the value of every lead row by subtracting the previous row value from the lead row for every five rows of the data frame. For example, the first value should retain its value, the second row should be: second row - first row. Similarly, the sixth row should retain its value, however, the seventh row would be seventh row - sixth row. Here is an example data frame
DF = data.frame(A= c(1:11), B = c(11:21))
The outputput should be like below
> Output
A B
1 1 11
2 1 1
3 1 1
4 1 1
5 1 1
6 6 16
7 1 1
8 1 1
9 1 1
10 1 1
11 11 21
One option would be to create a grouping variable and then do the transformation with diff which does the difference of adjacent elements of the columns selected in mutate_all (if only a subset of columns are needed either use mutate_if or mutate_at)
library(dplyr) #v_0.8.3
DF %>%
group_by(grp = as.integer(gl(n(), 5, n()))) %>%
mutate_all(~c(first(.), diff(.))) %>%
ungroup %>%
select(-grp)
# A tibble: 11 x 2
# A B
# <int> <int>
# 1 1 11
# 2 1 1
# 3 1 1
# 4 1 1
# 5 1 1
# 6 6 16
# 7 1 1
# 8 1 1
# 9 1 1
#10 1 1
#11 11 21
The above also gives a warning when we use mutate_all after group_by (previously it used to work - in the new versions, the correct syntax would be to use mutate_at
DF %>%
group_by(grp = as.integer(gl(n(), 5, n()))) %>%
mutate_at(vars(-group_cols()), ~c(first(.), diff(.))) %>%
ungroup %>%
select(-grp)
f = function(d, n = 5) ave(d, ceiling(seq_along(d)/n), FUN = function(x) c(x[1], diff(x)))
data.frame(lapply(DF, f))
# A B
#1 1 11
#2 1 1
#3 1 1
#4 1 1
#5 1 1
#6 6 16
#7 1 1
#8 1 1
#9 1 1
#10 1 1
#11 11 21
Another option would be to create another data.frame with shifted rows and subtract directly
ind = ave(1:nrow(DF), ceiling(1:nrow(DF)/5), FUN = function(x) c(x[1], x[-length(x)]))
DF2 = DF[ind,] * replace(rep(1, nrow(DF)), diff(ind) == 0, 0)
DF - DF2
You can %/% the row number minus 1 by 5 to get the groups, then use diff to get the difference from the previous x (or 0 if there is no previous x) from x for all columns x for each group.
library(data.table)
setDT(DF)
DF[, lapply(.SD, function(x) diff(c(0, x)))
, (1:nrow(DF) - 1) %/% 5][, -1]
# A B
# 1: 1 11
# 2: 1 1
# 3: 1 1
# 4: 1 1
# 5: 1 1
# 6: 6 16
# 7: 1 1
# 8: 1 1
# 9: 1 1
# 10: 1 1
# 11: 11 21
Or, as mentioned by #akrun, you could avoid lapply by replacing
lapply(.SD, function(x) diff(c(0, x)))
with
.SD - shift(.SD, fill = 0)
Another less serious option:
x <- DF[, !(.I - 1) %% 5]
DF*(1 + x) - DF[DF[, .I - !x]]
# A B
# 1: 1 11
# 2: 1 1
# 3: 1 1
# 4: 1 1
# 5: 1 1
# 6: 6 16
# 7: 1 1
# 8: 1 1
# 9: 1 1
# 10: 1 1
# 11: 11 21
I want to do something like this:
How to make a unique in R by column A and keep the row with maximum value in column B
Except my data.table has one key column, and multiple value columns. So say I have the following:
a b c
1: 1 1 1
2: 1 2 1
3: 1 2 2
4: 2 1 1
5: 2 2 5
6: 2 3 3
7: 3 1 4
8: 3 2 1
If the key is column a, I want for each unique a to return the row with the maximum b, and if there is more than one unique max b, get the one with the max c and so on for multiple columns. So the result should be:
a b c
1: 1 2 2
2: 2 3 3
3: 3 2 1
I'd also like this to be done for an arbitrary number of columns. So if my data.table had 20 columns, I'd want the max function to be applied in order from left to right.
Here is a suggested data.table solution. You might want to consider using data.table::frankv as follows:
DT[, .SD[frankv(.SD, ties.method="first")[.N],], by=a]
frankv returns the order. Then [.N] will take the largest rank. Then .SD[ subset to that particular row.
Please let me know if it fails for your larger dataset.
to make this work for any number of columns, a possible dplyr solution would be to use arrange_all
df <- data.frame(a = c(1,1,1,2,2,2,3,3), b = c(1,2,2,1,2,3,1,2),
c = c(1,1,2,1,5,3,4,1))
df %>% group_by(a) %>% arrange_all() %>% filter(row_number() == n())
# A tibble: 3 x 3
# Groups: a [3]
# a b c
# 1 1 2 2
# 2 2 3 3
# 3 3 2 1
The generic solution can be achieved for arbitrary number of column using mutate_at. In the below example c("a","b","c") are arbitrary columns.
library(dplyr)
df %>% arrange_at(.vars = vars(c("a","b","c"))) %>%
mutate(changed = ifelse(a != lead(a), TRUE, FALSE)) %>%
filter(is.na(changed) | changed ) %>%
select(-changed)
a b c
1 1 2 2
2 2 3 3
3 3 2 1
Another option could be using max and dplyr as below. The approach is to first group_by on a and then filter for max value of b. The again group_by on both a and b and filter for rows with max value of c.
library(dplyr)
df %>% group_by(a) %>%
filter(b == max(b)) %>%
group_by(a, b) %>%
filter(c == max(c))
# Groups: a, b [3]
# a b c
# <int> <int> <int>
#1 1 2 2
#2 2 3 3
#3 3 2 1
Data
df <- read.table(text = "a b c
1: 1 1 1
2: 1 2 1
3: 1 2 2
4: 2 1 1
5: 2 2 5
6: 2 3 3
7: 3 1 4
8: 3 2 1", header = TRUE, stringsAsFactors = FALSE)
dat <- data.frame(a = c(1,1,1,2,2,2,3,3),
b = c(1,2,2,1,2,3,1,2),
c = c(1,1,2,1,5,3,4,1))
library(sqldf)
sqldf("with d as (select * from 'dat' group by a order by b, c desc) select * from d order by a")
a b c
1 1 2 2
2 2 3 3
3 3 2 1
I have a dataframe as follows. It is ordered by column time.
Input -
df = data.frame(time = 1:20,
grp = sort(rep(1:5,4)),
var1 = rep(c('A','B'),10)
)
head(df,10)
time grp var1
1 1 1 A
2 2 1 B
3 3 1 A
4 4 1 B
5 5 2 A
6 6 2 B
7 7 2 A
8 8 2 B
9 9 3 A
10 10 3 B
I want to create another variable var2 which computes no of distinct var1 values so far i.e. until that point in time for each group grp . This is a little different from what I'd get if I were to use n_distinct.
Expected output -
time grp var1 var2
1 1 1 A 1
2 2 1 B 2
3 3 1 A 2
4 4 1 B 2
5 5 2 A 1
6 6 2 B 2
7 7 2 A 2
8 8 2 B 2
9 9 3 A 1
10 10 3 B 2
I want to create a function say cum_n_distinct for this and use it as -
d_out = df %>%
arrange(time) %>%
group_by(grp) %>%
mutate(var2 = cum_n_distinct(var1))
A dplyr solution inspired from #akrun's answer -
Ths logic is basically to set 1st occurrence of each unique values of var1 to 1 and rest to 0 for each group grp and then apply cumsum on it -
df = df %>%
arrange(time) %>%
group_by(grp,var1) %>%
mutate(var_temp = ifelse(row_number()==1,1,0)) %>%
group_by(grp) %>%
mutate(var2 = cumsum(var_temp)) %>%
select(-var_temp)
head(df,10)
Source: local data frame [10 x 4]
Groups: grp
time grp var1 var2
1 1 1 A 1
2 2 1 B 2
3 3 1 A 2
4 4 1 B 2
5 5 2 A 1
6 6 2 B 2
7 7 2 A 2
8 8 2 B 2
9 9 3 A 1
10 10 3 B 2
Assuming stuff is ordered by time already, first define a cumulative distinct function:
dist_cum <- function(var)
sapply(seq_along(var), function(x) length(unique(head(var, x))))
Then a base solution that uses ave to create groups (note, assumes var1 is factor), and then applies our function to each group:
transform(df, var2=ave(as.integer(var1), grp, FUN=dist_cum))
A data.table solution, basically doing the same thing:
library(data.table)
(data.table(df)[, var2:=dist_cum(var1), by=grp])
And dplyr, again, same thing:
library(dplyr)
df %>% group_by(grp) %>% mutate(var2=dist_cum(var1))
Try:
Update
With your new dataset, an approach in base R
df$var2 <- unlist(lapply(split(df, df$grp),
function(x) {x$var2 <-0
indx <- match(unique(x$var1), x$var1)
x$var2[indx] <- 1
cumsum(x$var2) }))
head(df,7)
# time grp var1 var2
# 1 1 1 A 1
# 2 2 1 B 2
# 3 3 1 A 2
# 4 4 1 B 2
# 5 5 2 A 1
# 6 6 2 B 2
# 7 7 2 A 2
Here's another solution using data.table that's pretty quick.
Generic Function
cum_n_distinct <- function(x, na.include = TRUE){
# Given a vector x, returns a corresponding vector y
# where the ith element of y gives the number of unique
# elements observed up to and including index i
# if na.include = TRUE (default) NA is counted as an
# additional unique element, otherwise it's essentially ignored
temp <- data.table(x, idx = seq_along(x))
firsts <- temp[temp[, .I[1L], by = x]$V1]
if(na.include == FALSE) firsts <- firsts[!is.na(x)]
y <- rep(0, times = length(x))
y[firsts$idx] <- 1
y <- cumsum(y)
return(y)
}
Example Use
cum_n_distinct(c(5,10,10,15,5)) # 1 2 2 3 3
cum_n_distinct(c(5,NA,10,15,5)) # 1 2 3 4 4
cum_n_distinct(c(5,NA,10,15,5), na.include = FALSE) # 1 1 2 3 3
Solution To Your Question
d_out = df %>%
arrange(time) %>%
group_by(grp) %>%
mutate(var2 = cum_n_distinct(var1))