R average over ages when some ages missing - r

I have a data.table with columns for Age, food category, and the kcal consumed. I'm trying to get the average kcal for each category, but for some of the categories there is no consumption in that category. So I can't take a simple average, because there are zeroes that aren't in the data.table.
So for the example data:
dtp2 <- data.table(age = c(4,4,4,5,6,18), category = c("chips","vegetables","pizza","chips","pizza","beer"), kcal = c(100,5,100,120,100,150))
just doing dtp2[,mean(kcal),by=category] gives the wrong answer because only the 18 year olds are consuming beer, and the 4-17 year olds aren't.
The actual data set is 4:18 year olds with many, many categories. I've tried populating the datatable with zeroes for omitted ages with a nested for loop, which is very slow, then taking the means as above.
Is there a sensible R way of taking the mean kcal where missing values are assumed to be zero, without nested for loops putting in the zeroes?

I take it you want to include missing or 0 kcal values in the calculation. Instead of taking the average, you could just sum by category and divide by the total n for each category.

The suggestion by Mr. Bugle is rather generic and doesn't show any code. Picking this up, the code of the OP needs to be modified as follows:
library(data.table)
dtp2[, sum(kcal) / uniqueN(dtp2$category), by = category]
which returns
category V1
1: chips 55.00
2: vegetables 1.25
3: pizza 50.00
4: beer 37.50
Note that uniqueN(dtp2$category) is used not just uniqueN(category) as this is always 1 when grouped by category.
However, there are situations where missing values are assumed to be zero, without nested for loops putting in the zeroes as the OP has asked.
One situation could arise when data is reshaped from long to wide format for presentation of the data:
reshape2::dcast(dtp2, age ~ category, fun = mean, value.var = "kcal", margins = TRUE)
age beer chips pizza vegetables (all)
1 4 NaN 100 100 5 68.33333
2 5 NaN 120 NaN NaN 120.00000
3 6 NaN NaN 100 NaN 100.00000
4 18 150 NaN NaN NaN 150.00000
5 (all) 150 110 100 5 95.83333
Here, the margin means are computed only from the available data which is not what the OP askd for. (Note that the parameter fill = 0 has no effect on the computation of the margins.)
So, the missing values need to be filled up before reshaping. In base R, expand.grid() can be used for this purpose, in data.table it's the cross join function CJ():
expanded <- dtp2[CJ(age, category, unique = TRUE), on = .(age = V1, category = V2)
][is.na(kcal), kcal := 0][]
expanded
age category kcal
1: 4 beer 0
2: 4 chips 100
3: 4 pizza 100
4: 4 vegetables 5
5: 5 beer 0
6: 5 chips 120
7: 5 pizza 0
8: 5 vegetables 0
9: 6 beer 0
10: 6 chips 0
11: 6 pizza 100
12: 6 vegetables 0
13: 18 beer 150
14: 18 chips 0
15: 18 pizza 0
16: 18 vegetables 0
Now, reshaping from long to wide returns the expected results:
reshape2::dcast(expanded, age ~ category, fun = mean, value.var = "kcal", margins = TRUE)
age beer chips pizza vegetables (all)
1 4 0.0 100 100 5.00 51.2500
2 5 0.0 120 0 0.00 30.0000
3 6 0.0 0 100 0.00 25.0000
4 18 150.0 0 0 0.00 37.5000
5 (all) 37.5 55 50 1.25 35.9375

Related

Calculate Median for a group based on lag and lead values in R

I have a complex ask, hence would appreciate your patience.
For each datapoint, I would like to first capture the values from lag and lead columns, and then identify those values in the series for my dataset, and compute the median value using the captured values for each group.
Group,Date,Month,Sales,lag,lead
Group1,42005,1,2503,0,2
Group1,42036,2,3734,0,2
Group1,42064,3,6631,2,3
Group1,42095,4,8606,0,0
Group1,42125,5,1889,0,2
Group1,42156,6,4819,1,2
Group1,42186,7,3294,1,0
Group1,42217,8,38999,2,0
Group1,42248,9,28372,1,0
Group1,42278,10,25396,4,1
Group1,42309,11,21093,1,0
Group2,42339,1,9263,0,3
Group2,42005,2,6660,1,3
Group2,42036,3,28595,2,2
Group2,42064,4,123,2,0
Group2,42095,5,11855,3,3
Group2,42125,6,15845,4,3
Group2,42156,7,32331,2,2
Group2,42186,8,3188,1,1
Group2,42217,9,38161,4,0
For example, if we look at Month 6 for Group1, the Sales value is 4819, the lag and lead value is 1, and 2 respectively.
I would like to first capture the lag and lead values, and then vlookup in the series. For 4819, for lag(value=1), i would like to go till one datapoint above 4819, (which is 1889, a.k.a 4819->1889, similarly for lead(value=2), i would like to go till two datapoints below 4891, which are 3294 and 38999. So now capture points for 4819 datapoint, is 1899,4819,3294 and 38999, hence now i would like to take the median of this, and store it in my output. THis exercise i want to do for each group.
Similarly, for Group 2, Month 4, i would like to take capture previous 2 lag datapoints with reference to 123( lead is zero, hence it will not be captured), and take median of the the total 3 values.
I tried the same for one particular case with an ifelse condition, to see how it works.
df$output <- ifelse(lag==0 & lead==1, median(Sales,lead(Sales,1)),0)
The result was very surprising. R took the median of all the values for the column. Another problem is even if it had worked, i would have to write multiple ifelse conditions, hence looking for a simpler solution.
Not sure how to do the approach the problem and do the exercise for each group in R.
Below is the output i am trying to achieve.
Group,Date,Month,Sales,lag,lead,Output
Group1,42005,1,2503,0,2,3734
Group1,42036,2,3734,0,2,6631
Group1,42064,3,6631,2,3,4276.5
Group1,42095,4,8606,0,0,8606
Group1,42125,5,1889,0,2,3294
Group1,42156,6,4819,1,2,4056.5
Group1,42186,7,3294,1,0,4056.5
Group1,42217,8,38999,2,0,4819
Group1,42248,9,28372,1,0,33685.5
Group1,42278,10,25396,4,1,23244.5
Group1,42309,11,21093,1,0,23244.5
Group2,42339,1,9263,0,3,7961.5
Group2,42005,2,6660,1,3,9263
Group2,42036,3,28595,2,2,9263
Group2,42064,4,123,2,0,6660
Group2,42095,5,11855,3,3,11855
Group2,42125,6,15845,4,3,13850
Group2,42156,7,32331,2,2,15845
Group2,42186,8,3188,1,1,32331
Group2,42217,9,38161,4,0,15845
Any leads would be highly appreciated.
I am missing something. please guide me how to solve this problem. If there is any function that i needs to use, please help me with the same.
Thanks,
df$Output <- sapply(seq(nrow(df)), # For each row (number) in df
function(i)
# take the median of Sales from
# current row - current lag value
# to
# current row + current lead value
with(df, median(Sales[(i - lag[i]):(i + lead[i])])))
Data Used:
df <- data.table::fread("
Group,Date,Month,Sales,lag,lead
Group1,42005,1,2503,0,2
Group1,42036,2,3734,0,2
Group1,42064,3,6631,2,3
Group1,42095,4,8606,0,0
Group1,42125,5,1889,0,2
Group1,42156,6,4819,1,2
Group1,42186,7,3294,1,0
Group1,42217,8,38999,2,0
Group1,42248,9,28372,1,0
Group1,42278,10,25396,4,1
Group1,42309,11,21093,1,0
Group2,42339,1,9263,0,3
Group2,42005,2,6660,1,3
Group2,42036,3,28595,2,2
Group2,42064,4,123,2,0
Group2,42095,5,11855,3,3
Group2,42125,6,15845,4,3
Group2,42156,7,32331,2,2
Group2,42186,8,3188,1,1
Group2,42217,9,38161,4,0
")
dout <- fread("
Group,Date,Month,Sales,lag,lead,Output
Group1,42005,1,2503,0,2,3734
Group1,42036,2,3734,0,2,6631
Group1,42064,3,6631,2,3,4276.5
Group1,42095,4,8606,0,0,8606
Group1,42125,5,1889,0,2,3294
Group1,42156,6,4819,1,2,4056.5
Group1,42186,7,3294,1,0,4056.5
Group1,42217,8,38999,2,0,4819
Group1,42248,9,28372,1,0,33685.5
Group1,42278,10,25396,4,1,23244.5
Group1,42309,11,21093,1,0,23244.5
Group2,42339,1,9263,0,3,7961.5
Group2,42005,2,6660,1,3,9263
Group2,42036,3,28595,2,2,9263
Group2,42064,4,123,2,0,6660
Group2,42095,5,11855,3,3,11855
Group2,42125,6,15845,4,3,13850
Group2,42156,7,32331,2,2,15845
Group2,42186,8,3188,1,1,32331
Group2,42217,9,38161,4,0,15845
")
all.equal(df$Output, dout$Output)
# [1] TRUE
setDT(df)[,i:=sequence(.N)][,med:=as.numeric(median(df$Sales[c((i-lag):(i+lead))])),by=i][,i:=NULL][]
Group Date Month Sales lag lead med
1: Group1 42005 1 2503 0 2 3734.0
2: Group1 42036 2 3734 0 2 6631.0
3: Group1 42064 3 6631 2 3 4276.5
4: Group1 42095 4 8606 0 0 8606.0
5: Group1 42125 5 1889 0 2 3294.0
6: Group1 42156 6 4819 1 2 4056.5
7: Group1 42186 7 3294 1 0 4056.5
8: Group1 42217 8 38999 2 0 4819.0
9: Group1 42248 9 28372 1 0 33685.5
10: Group1 42278 10 25396 4 1 23244.5
11: Group1 42309 11 21093 1 0 23244.5
12: Group2 42339 1 9263 0 3 7961.5
13: Group2 42005 2 6660 1 3 9263.0
14: Group2 42036 3 28595 2 2 9263.0
15: Group2 42064 4 123 2 0 6660.0
16: Group2 42095 5 11855 3 3 11855.0
17: Group2 42125 6 15845 4 3 13850.0
18: Group2 42156 7 32331 2 2 15845.0
19: Group2 42186 8 3188 1 1 32331.0
20: Group2 42217 9 38161 4 0 15845.0

Working with repeates values in rows

I am working with a df of 46216 observation where the units are homes and people, where each home may have any number of integrants, like:
enter image description here
and this for another almost 18000 homes.
What i need to do is to get the mean of education years for every home, for what i guess i will need a variable that computes the number of people of each home.
What i tried to do is:
num_peopl=by(df$person_number, df$home, max), for each home I take the highest person number with the total number of people who live there, but when I try to cbind this with the df i get:
"arguments imply differing number of rows: 46216, 17931"
It is like it puts the number of persons only for one row, and leaves the others empty.
How can i do this? Is there a function?
I think aggregate and join may be what your looking for. Aggregate does the same thing that you did, but puts it into a data frame that I'm more familiar with at least.
Then I used dplyr left_join, joining the home number's together:
library(tidyverse)
df<-data.frame(home_number = c(1,1,1,2,2,3),
person_number = c(1,2,3,1,2,1),
age = c(20,21,1,54,50,30),
sex = c("m","f","f","m","f","f"),
salary = c(1000,890,NA,900,500,1200),
years_education = c(12,10,0,8,7,14))
df2<-aggregate(df$person_number, by = list(df$home_number), max)
df_final<-df%>%
left_join(df2, by = c("home_number" = "Group.1"))
home_number person_number age sex salary years_education x
1 1 1 20 m 1000 12 3
2 1 2 21 f 890 10 3
3 1 3 1 f NA 0 3
4 2 1 54 m 900 8 2
5 2 2 50 f 500 7 2
6 3 1 30 f 1200 14 1

Row data to binary columns while preserving the number of rows

This is similar to this question R Convert row data to binary columns but I want to preserve the number of rows.
How can I convert the row data to binary columns while preserving the number of rows?
Example
Input
myData<-data.frame(gender=c("man","women","child","women","women","women","man"),
age=c(22, 22, 0.33,22,22,22,111))
myData
gender age
1 man 22.00
2 women 22.00
3 child 0.33
4 women 22.00
5 women 22.00
6 women 22.00
7 man 111.00
How to get to this intended output?
gender age man women child
1 man 22.00 1 0 0
2 women 22.00 0 1 0
3 child 0.33 0 0 1
4 women 22.00 0 1 0
5 women 22.00 0 1 0
6 women 22.00 0 1 0
7 man 111.00 1 0 0
Perhaps a slightly easier solution without reliance on another package:
data.frame(myData, model.matrix(~gender+0, myData))
We can use dcast to do this
library(data.table)
dcast(setDT(myData), gender + age + seq_len(nrow(myData)) ~
gender, length)[, myData := NULL][]
Or use table from base R and cbind with the original dataset
cbind(myData, as.data.frame.matrix(table(1:nrow(myData), myData$gender)))

Find a function to return value based on condition using R

I have a table with values
KId sales_month quantity_sold
100 1 0
100 2 0
100 3 0
496 2 6
511 2 10
846 1 4
846 2 6
846 3 1
338 1 6
338 2 0
now i require output as
KId sales_month quantity_sold result
100 1 0 1
100 2 0 1
100 3 0 1
496 2 6 1
511 2 10 1
846 1 4 1
846 2 6 1
846 3 1 0
338 1 6 1
338 2 0 1
Here, the calculation has to go as such if quantity sold for the month of march(3) is less than 60% of two months January(1) and February(2) quantity sold then the result should be 1 or else it should display 0. Require solution to perform this.
Thanks in advance.
If I understand well, your requirement is to compare sold quantity in month t with the sum of quantity sold in months t-1 and t-2. If so, I can suggest using dplyr package that offer the nice feature of grouping rows and mutating columns in your data frame.
resultData <- group_by(data, KId) %>%
arrange(sales_month) %>%
mutate(monthMinus1Qty = lag(quantity_sold,1), monthMinus2Qty = lag(quantity_sold, 2)) %>%
group_by(KId, sales_month) %>%
mutate(previous2MonthsQty = sum(monthMinus1Qty, monthMinus2Qty, na.rm = TRUE)) %>%
mutate(result = ifelse(quantity_sold/previous2MonthsQty >= 0.6,0,1)) %>%
select(KId,sales_month, quantity_sold, result)
The result is as below:
Adding
select(KId,sales_month, quantity_sold, result)
at the end let us display only columns we care about (and not all these intermediate steps).
I believe this should satisfy your requirement. NA is the result column are due to 0/0 division or no data at all for the previous months.
Should you need to expand your calculation beyond one calendar year, you can add year column and adjust group_by() arguments appropriately.
For more information on dplyr package, follow this link

Row Differences in Dataframe by Group

My problem has to do with finding row differences in a data frame by group. I've tried to do this a few ways. Here's an example. The real data set is several million rows long.
set.seed(314)
df = data.frame("group_id"=rep(c(1,2,3),3),
"date"=sample(seq(as.Date("1970-01-01"),Sys.Date(),by=1),9,replace=F),
"logical_value"=sample(c(T,F),9,replace=T),
"integer"=sample(1:100,9,replace=T),
"float"=runif(9))
df = df[order(df$group_id,df$date),]
I ordered it by group_id and date so that the diff function can find the sequential differences, which results in time ordered differences of the logical, integer, and float variables. I could easily do some sort of apply(df,2,diff), but I need it by group_id. Hence, doing apply(df,2,diff) results in extra unneeded results.
df
group_id date logical_value integer float
1 1 1974-05-13 FALSE 4 0.03472876
4 1 1979-12-02 TRUE 45 0.24493995
7 1 1980-08-18 TRUE 2 0.46662253
5 2 1978-12-08 TRUE 56 0.60039164
2 2 1981-12-26 TRUE 34 0.20081799
8 2 1986-05-19 FALSE 60 0.43928929
6 3 1983-05-22 FALSE 25 0.01792820
9 3 1994-04-20 FALSE 34 0.10905326
3 3 2003-11-04 TRUE 63 0.58365922
So I thought I could break up my data frame into chunks by group_id, and pass each chunk into a user defined function:
create_differences = function(data_group){
apply(data_group, 2, diff)
}
But I get errors using the code:
diff_df = lapply(split(df,df$group_id),create_differences)
Error in r[i1] - r[-length(r):-(length(r) - lag + 1L)] : non-numeric argument to binary operator
by(df,df$group_id,create_differences)
Error in r[i1] - r[-length(r):-(length(r) - lag + 1L)] : non-numeric argument to binary operator
As a side note, the data is nice, no NAs, nulls, blanks, and every group_id has at least 2 rows associated with it.
Edit 1: User alexis_laz correctly pointed out that my function needs to be sapply(data_group, diff).
Using this edit, I get a list of data frames (one list entry per group).
Edit 2:
The expected output would be a combined data frame of differences. Ideally, I would like to keep the group_id, but if not, it's not a big deal. Here is what the sample output should be like:
diff_df
group_id date logical_value integer float
[1,] 1 2029 1 41 0.2102112
[2,] 1 260 0 -43 0.2216826
[1,] 2 1114 0 -22 -0.3995737
[2,] 2 1605 -1 26 0.2384713
[1,] 3 3986 0 9 0.09112507
[2,] 3 3485 1 29 0.47460596
I think regarding the fact that you have millions of rows you can move to the data.table suitable for by group actions.
library(data.table)
DT <- as.data.table(df)
## this will order per group and per day
setkeyv(DT,c('group_id','date'))
## for all column apply diff
DT[,lapply(.SD,diff),group_id]
# group_id date logical_value integer float
# 1: 1 2029 days 1 41 0.21021119
# 2: 1 260 days 0 -43 0.22168257
# 3: 2 1114 days 0 -22 -0.39957366
# 4: 2 1604 days -1 26 0.23847130
# 5: 3 3987 days 0 9 0.09112507
# 6: 3 3485 days 1 29 0.47460596
It certainly won't be as quick compared to data.table but below is an only slightly ugly base solution using aggregate:
result <- aggregate(. ~ group_id, data=df, FUN=diff)
result <- cbind(result[1],lapply(result[-1], as.vector))
result[order(result$group_id),]
# group_id date logical_value integer float
#1 1 2029 1 41 0.21021119
#4 1 260 0 -43 0.22168257
#2 2 1114 0 -22 -0.39957366
#5 2 1604 -1 26 0.23847130
#3 3 3987 0 9 0.09112507
#6 3 3485 1 29 0.47460596

Resources