I have been having trouble with this problem for a while. So here's a sample data I am working with
dt <- data.frame(purchase_freq = c('1','2','3','4', '5'), count = c('6','2','5','9','11'))
and I want it to have a result similar to this
dt <- data.frame(purchase_freq = c('1','2','3','4', '5'), count = c('6','2','5','9','11'), cumulative_index = ('33','27','25','20','11'))
Thanks for the help!
Edit: Sorry this was not clear enough. Basically cumulative_index[1] = count[1:5], cumulative_index[2]=count[2:5], cumulative_index[3]=count[3:5] and so forth. I know this might be simple enough but I cannot really solve this one. Appreciate all the help
You can subtract sum of c2 value with the cumulative sum of c2.
transform(dt, c3 = sum(c2) - c(0, cumsum(c2[-nrow(dt)])))
# c1 c2 c3
#1 1 6 33
#2 2 2 27
#3 3 5 25
#4 4 9 20
#5 5 11 11
Can be written in dplyr and data.table as well :
library(dplyr)
dt %>% mutate(c3 = sum(c2) - lag(cumsum(c2), default = 0))
library(data.table)
setDT(dt)[, c3 := sum(c2) - shift(cumsum(c2), fill = 0)]
data
dt <- data.frame(c1 = c(1,2,3,4,5), c2 = c(6,2,5,9,11))
Related
I have the following example data.
data_1 <- data.frame("ID" = c('a','b','c','d','e'),
"value" = c(2,4,9,5,3))
data_2 <- data.frame("ID" = c('a','c','d','b','e','a','e','d','c'),
'var' =c(2,6,2,4,6,8,6,4,5))
I want to calculate new column in data_2 such that for the same ID in the two dataset, the value and var is multiplied.
Something like for data_1$ID==data_2$ID then data_1$value*data_2$var. So newVar would be (4,54,10,16,18,16,18,20,45).
Join the two dataframes and multiply value and var.
transform(merge(data_1, data_2, by = 'ID'), result = value * var)
You can also use match :
transform(data_2, result = var * data_1$value[match(ID, data_1$ID)])
# ID var result
#1 a 2 4
#2 c 6 54
#3 d 2 10
#4 b 4 16
#5 e 6 18
#6 a 8 16
#7 e 6 18
#8 d 4 20
#9 c 5 45
Using dplyr :
library(dplyr)
inner_join(data_1, data_2, by = 'ID') %>% mutate(result = value * var)
Using data.table
library(data.table)
setDT(data_1)[data_2, result := value * var, on = .(ID)]
I have a large database (90,000 * 1500) sorted by child observations - which includes their mom's info. I want to sort the database according to mom's data.
The problem is that each kid only appears once in DB mom bs. It may appear up to 10 times.
In addition, I want the number of rows to be a number of different mothers (approx. 40,000) and a bit of data for each child - between 0-10.
For example, the DB I have and the DB I want to create:
You could use reshape
library(data.table)
df = data.frame(
'c' = c('c1', 'c2', 'c3', 'c4', 'c5'),
'id_num' = seq(1,5),
'age' = c(12, 15, 5, 8, 19),
'mom'= c(1,3,1,2,3)
)
df
c id_num age mom
1 c1 1 12 1
2 c2 2 15 3
3 c3 3 5 1
4 c4 4 8 2
5 c5 5 19 3
df = setDT(df)[order(mom)]
df[, id_child := seq(.N), mom]
reshape(df, idvar = "mom", timevar = "id_child", direction = "wide")
mom c.1 id_num.1 age.1 c.2 id_num.2 age.2
1: 1 c1 1 12 c3 3 5
2: 2 c4 4 8 <NA> NA NA
3: 3 c2 2 15 c5 5 19
Here is a solution similar to #Metariat, but with base R, where ave() is used
df$seq <- with(df,ave(id_num,mom,FUN = seq_along))
dfout <- reshape(df, idvar = "mom", timevar = "seq", direction = "wide")
such that
> dfout
mom c.1 id_num.1 age.1 c.2 id_num.2 age.2
1 1 c1 1 12 c3 3 5
2 3 c2 2 15 c5 5 19
4 2 c4 4 8 <NA> NA NA
EDIT:
If you have very big data frame, you can try the divide and conquer policy to see if it works
library(plyr)
dfs <- split(df,df$mom)
lst <- lapply(dfs, function(x) {
x <- within(x,seqnum <- ave(id_num,mom,FUN = seq_along))
reshape(x, idvar = "mom", timevar = "seqnum", direction = "wide")
}
)
dfout <- rbind.fill(lst)
You can do this using the tidyr package, with group_by.
group_by(data, mom)
Then each mom contains a list of children. You can then sort the database as follows.
arrange(data, id_num, .by_group = TRUE)
To filter children between 0 and 10:
filter(data, age <= 10)
I need to find out how many factor levels reach values of a continuous variable.
The code below produces the desired result for the example data, but it is rather an awkward work around.
My real dataframe is much larger and the real plot should show more values (or is continuous) on the x-axis. I would appreciate an applicable code a lot.
set.seed(5)
df <- data.frame(ID = factor(c("a","a","b","c","d","e","e")),values = runif(7,0,6))
seq <- 1:5
length.unique <- function(x) length(unique(x))
sub1 <- df[which(df$values >= 1), ]
sub2 <- df[which(df$values >= 2), ]
sub3 <- df[which(df$values >= 3), ]
sub4 <- df[which(df$values >= 4), ]
sub5 <- df[which(df$values >= 5), ]
N_IDs <- c(length.unique(sub1$ID),length.unique(sub2$ID),length.unique(sub3$ID),length.unique(sub4$ID),length.unique(sub5$ID))
plot(N_IDs ~ seq, type="b")
Using tidyverse, you can save some time by first calculating the max value for each ID,
library(tidyverse)
idmax <- df %>% group_by(ID) %>% summarize(max=max(values)) %>% pull(max)
Then for each cut point, return the count that pass
map_df(1:5, ~data.frame(cut=., count=sum(idmax >.)))
# cut count
# 1 1 4
# 2 2 3
# 3 3 3
# 4 4 3
# 5 5 1
Using non-equi joins:
library(data.table)
setDT(df)
df[.(seq = 1:5), on = .(values >= seq), allow = T, .(N_IDs = uniqueN(ID)), by = .EACHI]
# values N_IDs
#1: 1 4
#2: 2 3
#3: 3 3
#4: 4 3
#5: 5 1
Apologies if this has been answered. I've gone through numerous examples today but I can't find any that match what I am trying to do.
I have a data set which I need to calculate a 3 point moving average on. I've generated some dummy data below:
set.seed(1234)
data.frame(Week = rep(seq(1:5), 3),
Section = c(rep("a", 5), rep("b", 5), rep("c", 5)),
Qty = runif(15, min = 100, max = 500),
To = runif(15, min = 40, max = 80))
I want to calculate the MA for each group based on the 'Section' column for both the 'Qty' and the 'To' columns. Ideally the output would be a data table. The moving average would start at Week 3 so would be the average of wks 1:3
I am trying to master the data.table package so a solution using that would be great but otherwise any will be much appreciated.
Just for reference my actual data set will have approx. 70 sections with c.1M rows in total. I've found the data.table to be extremely fast at crunching these kind of volumes so far.
We could use rollmean from the zoo package, in combination with data.table .
library(data.table)
library(zoo)
setDT(df)[, c("Qty.mean","To.mean") := lapply(.SD, rollmean, k = 3, fill = NA, align = "right"),
.SDcols = c("Qty","To"), by = Section]
> df
# Week Section Qty To Qty.mean To.mean
#1: 1 a 145.4814 73.49183 NA NA
#2: 2 a 348.9198 51.44893 NA NA
#3: 3 a 343.7099 50.67283 279.3703 58.53786
#4: 4 a 349.3518 47.46891 347.3271 49.86356
#5: 5 a 444.3662 49.28904 379.1426 49.14359
#6: 1 b 356.1242 52.66450 NA NA
#7: 2 b 103.7983 52.10773 NA NA
#8: 3 b 193.0202 46.36184 217.6476 50.37802
#9: 4 b 366.4335 41.59984 221.0840 46.68980
#10: 5 b 305.7005 48.75198 288.3847 45.57122
#11: 1 c 377.4365 72.42394 NA NA
#12: 2 c 317.9899 61.02790 NA NA
#13: 3 c 213.0934 76.58633 302.8400 70.01272
#14: 4 c 469.3734 73.25380 333.4856 70.28934
#15: 5 c 216.9263 41.83081 299.7977 63.89031
A solution using dplyr:
library(dplyr); library(zoo)
myfun = function(x) rollmean(x, k = 3, fill = NA, align = "right")
df %>% group_by(Section) %>% mutate_each(funs(myfun), Qty, To)
#### Week Section Qty To
#### (int) (fctr) (dbl) (dbl)
#### 1 1 a NA NA
#### 2 2 a NA NA
#### 3 3 a 279.3703 58.53786
#### 4 4 a 347.3271 49.86356
There is currently faster approach using new frollmean function in data.table 1.12.0.
setDT(df)[, c("Qty.mean","To.mean") := frollmean(.SD, 3),
.SDcols = c("Qty","To"),
by = Section]
I have a big data set (roughly 10 000 rows), and want to create a function that counts the number of complete cases (not NAs) per group. I tried various functions (aggregate, table, sum(complete.cases), group_by, etc), but somehow I miss one - probably little - trick. Thanks for any help!
A little sample data set to explain, the result I need.
x <- data.frame(group = c(1:4),
age = c(4:1, c(11, NA,13, NA)),
speed = c(12, NA,15,NA))
print(x)
# group age speed
#1 1 4 12
#2 2 3 NA
#3 3 2 15
#4 4 1 NA
#5 1 11 12
#6 2 NA NA
#7 3 13 15
#8 4 NA NA
One function I wrote reads as follows:
CountPerGroup <- function(group) {
data.set <- subset(x,group %in% group)
vect <- vector()
for (i in 1:length(group)) {
vect[i] <- sum(complete.cases(data.set))
}
output <- data.frame(cbind(group,count=vect))
return(output)
}
The result of
CountPerGroup(2:1)
is
group count
1 2 4
2 1 4
Unfortunately, this is wrong. Instead the outcome should look like
group count
1 2 1
2 1 4
What am I missing? How can I tell R to count of complete.cases per Group?
Thank you very much for any help on this!
Something like should do the trick if you wish to maintain your functionality:
x <- data.frame(group = c(1:4),
age = c(4:1, c(11, NA,13, NA)),
speed = c(12, NA,15,NA))
CountPerGroup <- function(x, groups) {
data.set <- subset(x, group %in% groups)
ans <- sapply(split(data.set, data.set$group),
function(y) sum(complete.cases(y)))
return(data.frame(group = names(ans), count = unname(ans)))
}
CountPerGroup(x, 1:2)
# group count
#1 1 2
#2 2 0
Which is correct from what I can count. But it does not agree with your suggested outcome.
EDIT
It seems that you want the number of non-NA instead and correctly sorted. Use this function instead:
CountPerGroup2 <- function(x, groups) {
data.set <- subset(x, group %in% groups)
ans <- sapply(split(data.set, data.set$group),
function(y) sum(!is.na(y[, !grepl("group", names(y))])))[groups]
return(data.frame(group = names(ans), count = unname(ans)))
}
CountPerGroup2(x, 2:1)
# group count
#1 2 1
#2 1 4
If you are just looking for a way to get the full count of non-NA values per group, you could use something like:
library(plyr)
x <- data.frame(group = c(1:4),
age = c(4:1, c(11, NA,13, NA)),
speed = c(12, NA,15,NA))
counts <- ddply(x, "group", summarize, count=sum(!is.na(c(age, speed))))
## group count
## 1 1 4
## 2 2 1
## 3 3 4
## 4 4 1
You do miss out on having a function that lets you query a subset of the groups, but you get a one-line way to calculate the full solution.
Here is a way with data.table
library(data.table)
library(functional)
countPerGroup = function(x, vec)
{
dt = data.table(x)
d1 = setkey(dt, group)[group %in% vec]
d2 = d1[,lapply(.SD, Compose(Negate(is.na), sum)),by=group]
transform(d2, count=age+speed, speed=NULL, age=NULL)
}
countPerGroup(x, 1:2)
# group count
#1: 1 4
#2: 2 1
countPerGroup(x, c(1,2))
# group count
#1: 1 4
#2: 2 1
If you have a high number of lines in your data.table, it is particularly efficient!
I just had the same problem and found an easier solution
library(data.table)
x <- data.table(group = c(1:4),
age = c(4:1, c(11, NA,13, NA)),
speed = c(12, NA,15,NA))
x[,sum(complete.cases(.SD)), by=group]