I am fitting a linear model to this data:
data <- data.frame(Student_ID =c(1,1,1,2,2,3,3,3,3,3,4,4,4,5,6,6,7,7,7,8,8),
Years_Attended = c(1991,1992,1995,1992,1993,1991,1992,1993,1994,1995,1993,1994,1995,1995,1993,1995,1990,1995,2000,1995,1996),
Class = c("A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","C","C","C","C","C"),
marks = c(50,55,46,44,60,66,67,80,91,90,70,75,76,77,77,82,89,88,88,64,65))
The purpose is to create a new column that determines change in marks. I call this column marks.change and I fit the model as follows:
data2 <- data %>% group_by(Student_ID) %>% summarise(
Good.marks = length(marks[!is.na(marks)]),
marks.change = ifelse(Good.marks>1,
summary(lm(marks ~ Years_Attended))$coefficients[2, 1], 0),
Student_ID = unique(Student_ID),
Class = unique(Class),
)
This code works fine. However, as opposed to considering all the years at once, I would like to fit the model above (i.e., the part where I say “marks.change =…”) for every interval in years then averaging them. Meaning I would like to fit the model between 1991 and 1992 only then move to 1992 and 1993, then move to 1993 and 1994 etc up to the final year and then putting the average of these calculations in a new column called marks.change.part2
Is there an easier way to automate this?
You may simplify your existing code a bit
data %>% group_by(Student_ID, Class) %>% summarise(
Good.marks = sum(!is.na(marks)),
marks.change = ifelse(Good.marks>1,
summary(lm(marks ~ Years_Attended))$coefficients[2, 1], 0),
)
# A tibble: 8 x 4
# Groups: Student_ID [8]
Student_ID Class Good.marks marks.change
<dbl> <chr> <int> <dbl>
1 1 A 3 -1.46
2 2 A 2 16.
3 3 A 5 7.2
4 4 B 3 3.
5 5 B 1 0
6 6 B 2 2.50
7 7 C 3 -0.1
8 8 C 2 1.00
Now your question part- If I am understanding you correctly, perhaps you want this. Actually linear model on a two-point data is nothing but calculating slope manually, which you can easily calculate using simple vector maths.
data %>% group_by(Student_ID, Class) %>% summarise(
Good.marks = sum(!is.na(marks)),
marks.change = ifelse(Good.marks>1,
summary(lm(marks ~ Years_Attended))$coefficients[2, 1], 0),
marks.change.part2 = ifelse(Good.marks>1, mean(diff(marks)/diff(Years_Attended)), 0))
# A tibble: 8 x 5
# Groups: Student_ID [8]
Student_ID Class Good.marks marks.change marks.change.part2
<dbl> <chr> <int> <dbl> <dbl>
1 1 A 3 -1.46 1
2 2 A 2 16. 16
3 3 A 5 7.2 6
4 4 B 3 3. 3
5 5 B 1 0 0
6 6 B 2 2.50 2.5
7 7 C 3 -0.1 -0.1
8 8 C 2 1.00 1
Related
For example, I want to classify R1 based on R2.
R1 is like
# A tibble: 5 x 2
lon lat
<dbl> <dbl>
1 1 2
2 3 5
3 6 8
4 5 10
5 3 2
and R2 is like
# A tibble: 3 x 3
lon lat place
<dbl> <dbl> <chr>
1 1 2 A
2 3 6 B
3 5 8 C
R2 is like a standard. I want to find the corresponding place for my observations in R1. Suppose the 1st place in R1 is graded like:
scores of A: (1-1)^2 + (2-2)^2 = 0
scores of B: (1-3)^2 + (2-6)^2 = 20
scores of C: (1-5)^2 + (2-8)^2 = 52
If the scores of any place may be smaller than 3, we classify this place into the class.
The final result should be like this
# A tibble: 5 x 2
lon lat place
<dbl> <dbl> <chr>
1 1 2 A
2 3 5 B
3 6 8 C
4 5 10 NA
5 3 2 NA
There might be a neater way to do this with some purrr mapping, but using a couple of loops instead could get you the desired results:
library(tidyverse)
## Create R1 and R2 as tibbles, with place as a row name
R1 <- tribble(~lon, ~lat,
1,2,
3,5,
6,8,
5,10,
3,2)
R2 <- tribble(~lon, ~lat,~place,
1,2,"A",
3,6,"B",
5,8,"C") %>% column_to_rownames(var = "place")
## Create a results tibble
results <- R1 %>% mutate(A = NaN, B = NaN, C = NaN, match = "NA")
## Function to calculate place scores
place_scores <- function(vec){
apply(R2,1,function(x) x-vec) %>%
apply(.,2,function(x) x^2) %>%
colSums()
}
## Run function in a loop for each row in R1
for(i in 1:nrow(R1)){
res <- place_scores(as.numeric(R1[i,]))
results[i,3:5] <- res
}
## Run another loop to match the column with the lowest score and < 3
for(i in 1:nrow(results)){
match <- ifelse(any( results[i,3:5] < 3), colnames(results[,3:5])[which.min(as.numeric(results[i,3:5]))], NA)
results$match[i] <- match
}
results
# A tibble: 5 x 6
lon lat A B C match
<dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 2 0 20 52 A
2 3 5 13 1 13 B
3 6 8 61 13 1 C
4 5 10 80 20 4 NA
5 3 2 4 16 40 NA
I also came up with a way to do this using for-loop:
class = R2$place
for (i in 1:length(R1$place))
{
dist = rep(0, length(R2$place))
for (j in 1:length(R2$place))
{
dist[j] = (R1[i, 1] - R2[j,1])^2 + (R1[i, 2] - R2[j, 2])^2
}
R1$class[i] = class[which(dist <= 3)]
}
I need to summarize a data.frame across multiple columns in a generic way:
the first summarize operation is easy, e.g. a simple median, and is straightforward;
the second summarize then includes a condition on another column, e.g. taking the value where these is a minimum (by group) in another column:
set.seed(4)
myDF = data.frame(i = rep(1:3, each=3),
j = rnorm(9),
a = sample.int(9),
b = sample.int(9),
c = sample.int(9),
d = 'foo')
# i j a b c d
# 1 1 0.2167549 4 5 5 foo
# 2 1 -0.5424926 7 7 4 foo
# 3 1 0.8911446 3 9 1 foo
# 4 2 0.5959806 8 6 8 foo
# 5 2 1.6356180 6 8 3 foo
# 6 2 0.6892754 1 4 6 foo
# 7 3 -1.2812466 9 1 7 foo
# 8 3 -0.2131445 5 2 2 foo
# 9 3 1.8965399 2 3 9 foo
myDF %>% group_by(i) %>% summarize(across(where(is.numeric), median, .names="med_{col}"),
best_a = a[[which.min(j)]],
best_b = b[[which.min(j)]],
best_c = c[[which.min(j)]])
# # A tibble: 3 x 8
# i med_j med_a med_b med_c best_a best_b best_c
# * <int> <dbl> <int> <int> <int> <int> <int> <int>
# 1 1 0.217 4 7 4 7 7 4
# 2 2 0.689 6 6 6 8 6 8
# 3 3 -0.213 5 2 7 9 1 7
How can I define this second summarize operation in a generic way (i.e., not manually as done above)?
Hence I would need something like this (which obviously does not work as j is not recognized):
myfns = list(med = ~median(.),
best = ~.[[which.min(j)]])
myDF %>% group_by(i) %>% summarize(across(where(is.numeric), myfns, .names="{fn}_{col}"))
# Error: Problem with `summarise()` input `..1`.
# x object 'j' not found
# ℹ Input `..1` is `across(where(is.numeric), myfns, .names = "{fn}_{col}")`.
# ℹ The error occurred in group 1: i = 1.
Use another across to get corresponding values in column a:c where j is minimum.
library(dplyr)
myDF %>%
group_by(i) %>%
summarize(across(where(is.numeric), median, .names="med_{col}"),
across(a:c, ~.[which.min(j)],.names = 'best_{col}'))
# i med_j med_a med_b med_c best_a best_b best_c
#* <int> <dbl> <int> <int> <int> <int> <int> <int>
#1 1 0.217 4 7 4 7 7 4
#2 2 0.689 6 6 6 8 6 8
#3 3 -0.213 5 2 7 9 1 7
To do it in the same across statement :
myDF %>%
group_by(i) %>%
summarize(across(where(is.numeric), list(med = median,
best = ~.[which.min(j)]),
.names="{fn}_{col}"))
I have a data frame with three variables and some missing values in one of the variables that looks like this:
subject <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
part <- c(0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3)
sad <- c(1,7,7,4,NA,NA,2,2,NA,2,3,NA,NA,2,2,1,NA,5,NA,6,6,NA,NA,3,3,NA,NA,5,3,NA,7,2)
df1 <- data.frame(subject,part,sad)
I have created a new data frame with the mean values of 'sad' per subject and part using a loop, like this:
columns<-c("sad.m",
"part",
"subject")
df2<-matrix(data=NA,nrow=1,ncol=length(columns))
df2<-data.frame(df2)
names(df2)<-columns
tn<-unique(df1$subject)
row=1
for (s in tn){
for (i in 0:3){
TN<-df1[df1$subject==s&df1$part==i,]
df2[row,"sad.m"]<-mean(as.numeric(TN$sad), na.rm = TRUE)
df2[row,"part"]<-i
df2[row,"subject"]<-s
row=row+1
}
}
Now I want to include an additional variable 'missing' in that indicates the percentage of rows per subject and part with missing values, so that I get df3:
subject <- c(1,1,1,1,2,2,2,2)
part<-c(0,1,2,3,0,1,2,3)
sad.m<-df2$sad.m
missing <- c(0,50,50,25,50,50,50,25)
df3 <- data.frame(subject,part,sad.m,missing)
I'd really appreciate any help on how to go about this!
It's best to try and avoid loops in R where possible, since they can get messy and tend to be quite slow. For this sort of thing, dplyr library is perfect and well worth learning. It can save you a lot of time.
You can create a data frame with both variables by first grouping by subject and part, and then performing a summary of the grouped data frame:
df2 = df1 %>%
dplyr::group_by(subject, part) %>%
dplyr::summarise(
sad_mean = mean(na.omit(sad)),
na_count = (sum(is.na(sad) / n()) * 100)
)
df2
# A tibble: 8 x 4
# Groups: subject [2]
subject part sad_mean na_count
<dbl> <dbl> <dbl> <dbl>
1 1 0 4.75 0
2 1 1 2 50
3 1 2 2.5 50
4 1 3 1.67 25
5 2 0 5.5 50
6 2 1 4.5 50
7 2 2 4 50
8 2 3 4 25
For each subject and part you can calculate mean of sad and calculate ratio of NA value using is.na and mean.
library(dplyr)
df1 %>%
group_by(subject, part) %>%
summarise(sad.m = mean(sad, na.rm = TRUE),
perc_missing = mean(is.na(sad)) * 100)
# subject part sad.m perc_missing
# <dbl> <dbl> <dbl> <dbl>
#1 1 0 4.75 0
#2 1 1 2 50
#3 1 2 2.5 50
#4 1 3 1.67 25
#5 2 0 5.5 50
#6 2 1 4.5 50
#7 2 2 4 50
#8 2 3 4 25
Same logic with data.table :
library(data.table)
setDT(df1)[, .(sad.m = mean(sad, na.rm = TRUE),
perc_missing = mean(is.na(sad)) * 100), .(subject, part)]
Try this dplyr approach to compute df3:
library(dplyr)
#Code
df3 <- df1 %>% group_by(subject,part) %>% summarise(N=100*length(which(is.na(sad)))/length(sad))
Output:
# A tibble: 8 x 3
# Groups: subject [2]
subject part N
<dbl> <dbl> <dbl>
1 1 0 0
2 1 1 50
3 1 2 50
4 1 3 25
5 2 0 50
6 2 1 50
7 2 2 50
8 2 3 25
And for full interaction with df2 you can use left_join():
#Left join
df3 <- df1 %>% group_by(subject,part) %>%
summarise(N=100*length(which(is.na(sad)))/length(sad)) %>%
left_join(df2)
Output:
# A tibble: 8 x 4
# Groups: subject [2]
subject part N sad.m
<dbl> <dbl> <dbl> <dbl>
1 1 0 0 4.75
2 1 1 50 2
3 1 2 50 2.5
4 1 3 25 1.67
5 2 0 50 5.5
6 2 1 50 4.5
7 2 2 50 4
8 2 3 25 4
Let me first give you an idea of how the data looks like:
Customer Value Module SubModule ModuleTF month department newCust
1 5 M1 SM1 1 1 DEP1 0
1 3 M1 SM1 1 2 DEP1 0
1 8 M1 SM1 1 3 DEP1 0
1 4 M2 SM1 1 1 DEP2 0
1 5 M2 SM2 1 1 DEP2 0
1 45 A5 null 0 1 DEP2 0
2
...
What I would like to do is to calculate a slope for VALUE of MONTH where it would be a new column in df. The problem is that It would need to be calculated for every module, sub module, and department. Not calculated if newCust = 0.
The thing is also that sometimes values for X month are null and therefore not present in the dataset. I would like these null values to be included as they obviously affect the slope. What is more, Modules sometimes do not have a submodule and calculation should be done in this case as well. Would it be necessary to enter those null values so all Modules and Sub Modules have equal number of entries?
I would like the outcome to look sth like this
Customer Value Module SubModule ModuleTF month department newCust slope
1 5 M1 SM1 1 1 DEP1 0 1.2
1 3 M1 SM1 1 2 DEP1 0 1.2
1 8 M1 SM1 1 3 DEP1 0 1.2
1 4 M2 SM1 1 1 DEP2 0 1.35
1 5 M2 SM2 1 1 DEP2 0 1.11
1 45 A5 null 0 1 DEP2 0 0.23
2
...
Any help will be more than appreciated!
Thanks!
What you can do is use dplyr and purrr to create linear models using the lm-function by different groups.
Taken the necessary data from your example, one could do
library(dplyr) # for the data munging
library(purrr) # for the do-function (modelling)
# create some example data
df <- data_frame(
customer = rep(1, 6),
value = c(5, 3, 8, 4, 5, 45),
month = c(1, 2, 3, 1, 2, 3),
departement = rep(c("Dep1", "Dep2"), each = 3)
)
# look at the data
df
#> # A tibble: 6 x 4
#> customer value month departement
#> <dbl> <dbl> <dbl> <chr>
#> 1 1 5 1 Dep1
#> 2 1 3 2 Dep1
#> 3 1 8 3 Dep1
#> 4 1 4 1 Dep2
#> 5 1 5 2 Dep2
#> 6 1 45 3 Dep2
# create a linear model per group
df %>%
group_by(customer, departement) %>%
do(mod_lin = lm(value~month, data = .)) %>%
mutate(intercept = mod_lin$coefficients[1],
slope = mod_lin$coefficients[2])
#> Source: local data frame [2 x 5]
#> Groups: <by row>
#>
#> # A tibble: 2 x 5
#> customer departement mod_lin intercept slope
#> <dbl> <chr> <list> <dbl> <dbl>
#> 1 1 Dep1 <S3: lm> 2.333333 1.5
#> 2 1 Dep2 <S3: lm> -23.000000 20.5
If you want to learn more about the aspects of the code, just search for dplyr, r piping, and purrr. The two packages have wonderful explanations.
I'd like to calculate relative changes of measured variables in a data.frame by group with dplyr.
The changes are with respect to a first baseline value at time==0.
I can easily do this in the following example:
# with this easy example it works
df.easy <- data.frame( id =c(1,1,1,2,2,2)
,time=c(0,1,2,0,1,2)
,meas=c(5,6,9,4,5,6))
df.easy %>% dplyr::group_by(id) %>% dplyr::mutate(meas.relative =
meas/meas[time==0])
# Source: local data frame [6 x 4]
# Groups: id [2]
#
# id time meas meas.relative
# <dbl> <dbl> <dbl> <dbl>
# 1 1 0 5 1.00
# 2 1 1 6 1.20
# 3 1 2 9 1.80
# 4 2 0 4 1.00
# 5 2 1 5 1.25
# 6 2 2 6 1.50
However, when there are id's with no measuremnt at time==0, this doesn't work.
A similar question is this, but I'd like to get an NA as a result instead of simply taking the first occurence as baseline.
# how to output NA in case there are id's with no measurement at time==0?
df <- data.frame( id =c(1,1,1,2,2,2,3,3)
,time=c(0,1,2,0,1,2,1,2)
,meas=c(5,6,9,4,5,6,5,6))
# same approach now gives an error:
df %>% dplyr::group_by(id) %>% dplyr::mutate(meas.relative = meas/meas[time==0])
# Error in mutate_impl(.data, dots) :
# incompatible size (0), expecting 2 (the group size) or 1
Let's try to return NA in case no measurement at time==0 was taken, using ifelse
df %>% dplyr::group_by(id) %>% dplyr::mutate(meas.relative = ifelse(any(time==0), meas/meas[time==0], NA) )
# Source: local data frame [8 x 4]
# Groups: id [3]
#
# id time meas meas.relative
# <dbl> <dbl> <dbl> <dbl>
# 1 1 0 5 1
# 2 1 1 6 1
# 3 1 2 9 1
# 4 2 0 4 1
# 5 2 1 5 1
# 6 2 2 6 1
# 7 3 1 5 NA
# 8 3 2 6 NA>
Wait, why is above the relative measurement 1?
identical(
df %>% dplyr::group_by(id) %>% dplyr::mutate(meas.relative = ifelse(any(time==0), meas, NA) ),
df %>% dplyr::group_by(id) %>% dplyr::mutate(meas.relative = ifelse(any(time==0), meas[time==0], NA) )
)
# TRUE
It seems that the ifelse prevents meas to pick the current line, but selects always the subset where time==0.
How can I calculate relative changes when there are IDs with no baseline measurement?
Your issue was in the ifelse(). According to the ifelse documentation it returns "A vector of the same length...as test". Since any(time==0) is of length 1 for each group (TRUE or FALSE) only the first observation of the meas / meas[time==0] was being selected. This was then repeated to fill each group.
To fix this all I did was rep the any() to be the length of the group. I believe this should work:
df %>% dplyr::group_by(id) %>%
dplyr::mutate(meas.relative = ifelse(rep(any(time==0),times = n()), meas/meas[time==0], NA) )
# id time meas meas.relative
# <dbl> <dbl> <dbl> <dbl>
# 1 1 0 5 1.00
# 2 1 1 6 1.20
# 3 1 2 9 1.80
# 4 2 0 4 1.00
# 5 2 1 5 1.25
# 6 2 2 6 1.50
# 7 3 1 5 NA
# 8 3 2 6 NA
To see how this was working incorrectly in your case try:
ifelse(TRUE,c(1,2,3),NA)
#[1] 1
Edit: A data.table solution with the same concept:
as.data.table(df)[, meas.rel := ifelse(rep(any(time==0), .N), meas/meas[time==0], NA_real_)
,by=id]