I'm trying to run an aggregate SD function, but I'm getting an error message I can't resolve, or else I'm getting output that doesn't work. I'm including sample data- the goal is to run on a larger data set, but I can't even get the aggregate function to work on these three columns.
dput(droplevels(controls2[1:20, 1:3]))
structure(list(Experiment = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Ceres- Clipping",
"FGI- Defoliation"), class = "factor"), Grain = c(489.9, 698.5,
430.6, 244.9, 476.5, 545.4, 570.2, 463.1, 285.1, 407.6, 244.9,
401.9, 126.3, 179.9, 382.7, 266, 653, 653, 606.6, 606.6), Environment = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L), .Label = c("Morris.1", "St. Paul.1"), class = "factor")), row.names = c(3565L,
3566L, 3567L, 3568L, 3569L, 3570L, 3571L, 3572L, 3573L, 3574L,
3575L, 3576L, 3577L, 3578L, 3579L, 3580L, 2379L, 2380L, 2381L,
2382L), class = "data.frame")
controlSDs <- aggregate(x = controls2, by = list(controls2$Experiment, controls2$Environment), FUN = "sd")
I get an error message:
Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) :
Calling var(x) on a factor x is defunct.
Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
However, the only column that I'm trying to perform sd() on, controls2$Grain, is numeric:
names(controls2)
[1] "Experiment" "Grain" "Environment"
class(controls2$Grain)
[1] "numeric"
I understand controls2$Environment and controls2$Experiment are factors, but I have run this command before with factors in the by = list() command and it has worked. I've also tried the following:
controlSDs <- aggregate(cbind(Experiment, Environment) ~ Grain, data = controls2, sd)
Which does not return an error message, however, the values for controlSDs$Experiment and controlSDs$Environment have been replaced with 0s and NAs such that I cannot use them to combine the data set with a data frame of means calculated using a similar aggregate function.
head(controlSDs)
Grain Experiment Environment
1 0.0 0 0
2 30.0 NA NA
3 44.0 NA NA
4 44.3 NA NA
5 46.0 NA NA
6 48.0 NA NA
Any advice on how to get this aggregate SD function to work correctly would be much appreciated. I would be happy with a solution that simply allows me to calculate the SD of the Grain column, but ideally I could scale this up to a 100+ column of a data set that is entirely numeric aside from the Environment and Experiment columns. I've updated R and R Studio within the last two weeks. I'm still learning how to make reproducible questions so please let me know if there's anything I can do to improve this question.
Are you looking for this. When you specify the formula, you need to set numeric variables to left of ~:
#Code
controlSDs <- aggregate(data = controls2,Grain~.,
FUN = sd)
Output:
controlSDs
Experiment Environment Grain
1 Ceres- Clipping Morris.1 154.67734
2 FGI- Defoliation St. Paul.1 26.78905
Based on your attempts, this can also work:
#Code2
controlSDs <- aggregate(Grain~Experiment+Environment , data = controls2, sd)
Same output.
We can use dplyr
library(dplyr)
controls2 %>%
group_by(Experiment, Environment) %>%
summarise(Grain = sd(Grain))
Related
I'm gradually getting used to recoding variables in R but I'm having a bit of trouble creating two new variables. For example, I have tried the following:
income2018$income2 <- dplyr::recode(income2018$income, '51' = 1L, '52' = 1L, '53' = 2L)
income2018$income3 <- dplyr::recode(income2018$income, '57' = 1L, '58' = 1L, '50' = 2L)
It doesn't look like the values are being correctly applied to the new variables.
Here is the SPSS syntax that I am attempting to recreate:
RECODE income (51,52=1)(53=2) into income2
RECODE income (57,58=1)(50=2) into income3
I'd be very grateful for any assistance.
Many thanks.
It looks like you might need to rearrange your code a little bit, but it's hard to tell without a reprex
You might want to try:
income2018 <- income2018 %>%
dplyr::mutate(income2 = income) %>%
dplyr::recode(income2, '51' = 1L, '52' = 1L, '53' = 2L)
Not sure if someone has answered this - I have searched, but so far nothing has worked for me. I have a very large dataset that I am trying to narrow. I need to combine three factors in my "PROG" variable ("Grad.2","Grad.3","Grad.H") so that they become a single variable ("Grad") where the dependent variable ("NUMBER") of each comparable set of values is summed.
ie.
YEAR = "92/93" AGE = "20-24" PROG = "Grad.2" NUMBER = "50"
YEAR = "92/93" AGE = "20-24" PROG = "Grad.3" NUMBER = "25"
YEAR = "92/93" AGE = "20-24" PROG = "Grad.H" NUMBER = "2"
turns into
YEAR = "92/93" AGE = "20-24" PROG = "Grad" NUMBER = "77"
I want to then drop all other factors for PROG so that I can compare the enrollment rates for Grad without worrying about the other factors (which I deal with separately). So my active independent variables are YEAR and AGE, while the dependent variable is NUMBER.
I hope this shows my data adequately:
structure(list
(YEAR = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("92/93", "93/94", "94/95", "95/96", "96/97",
"97/98", "98/99", "99/00", "00/01", "01/02", "02/03", "03/04",
"04/05", "05/06", "06/07", "07/08", "08/09", "09/10", "10/11",
"11/12", "12/13", "13/14", "14/15", "15/16"), class = "factor"),
AGE = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L), .Label = c("1-19",
"20-24", "25-30", "31-34", "35-39", "40+", "NR", "T.Age"), class = c("ordered",
"factor")),
PROG = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
19L, 19L, 19L), .Label = c("T.Prog", "Basic", "Career", "Grad.H",
"Grad2", "Grad3", "Grad2.Qual", "Grad3.Qual", "Health.Res",
"NoProg.Grad", "NoProg.Other", "NoProg.Und.Grad", "NoProg.NoCred",
"Other", "Post.Und.Grad", "Post.Career", "Pre-U", "Career.Qual",
"Und.Grad", "Und.Grad.Qual"), class = "factor"),
NUMBER = c(104997L,
347235L, 112644L, 38838L, 35949L, 50598L, 5484L, 104991L,
333807L, 76692L)), row.names = c(7936L, 7948L, 7960L, 7972L,
7984L, 7996L, 8008L, 10459L, 10471L, 10483L), class = "data.frame")
In terms of why I am using factors, I don't know how else I should enter the data. Factors made sense, and they were how R interpreted the raw data when I uploaded it.
I am working on the suggestions below. Not had success yet, but I am still learning how to get R to do what I want, and frequently mess up. Will respond to each of you as soon as I have a reasonable answer to give. (And once I stop banging my poor head on my desk... sigh)
If I understand your question correctly, this should do it.
I am assuming your data frame is named df:
library(tidyverse)
df %>%
mutate(PROG = ifelse(PROG %in% c("Grad2", "Grad3","Grad.H"),
"Grad",
NA)) %>% ##combines the 3 Grad variables into one
filter(!is.na(PROG)) %>% ##drops the other variables
group_by(YEAR, AGE) %>%
summarise(NUMBER = sum(NUMBER))
Slightly different approach: only take factors you want, drop the factor variable (because you want to treat them as a group) and sum up all NUMBER values while grouping by all other variables. df is your data.
aggregate(formula = NUMBER ~ .,
data = subset(df, PROG %in% c("Grad2", "Grad3", "Grad.H"), select = -PROG),
FUN = sum)
There are multiple ways to do this, but I agree with FScott that you are likely looking for the levels() function to rename the factor levels. Here is how I would do the second step of summing.
library(magrittr)
library(dplyr)
#do the renaming of the PROG variables here
#sum by PROG
df <- df %>%
group_by(PROG) %>% # you could add more variable names here to group by i.e. group_by(PROG, AGE, YEAR)
mutate(group.sum= sum(NUMBER))
This chunk will make a new column in df named group.sum with the sum between subsetted groups defined by the group_by() function
if you wanted to condense the data.frame further as where the individual values in NUMBER are replaced with group.sum, again there are many ways to do this but here is a simple way.
#condense df down
df$number <- df$group.sum
df <- df[,-ncol(df)]
df <- unique(df)
A side note: I wouldn't recommend doing the above chunk because you loose information in your data, and your data is more tidy just having the extra column group.sum
I think the levels() function is what you are looking for. From the manual:
## combine some levels
z <- gl(3, 2, 12, labels = c("apple", "salad", "orange"))
z
levels(z) <- c("fruit", "veg", "fruit")
z
I named your data temp and ran this code. It works for me.
z<-gl(n=length(temp$PROG),k=2,labels=c("T.Prog", "Basic", "Career", "Grad.H",
"Grad2", "Grad3", "Grad2.Qual", "Grad3.Qual", "Health.Res",
"NoProg.Grad", "NoProg.Other", "NoProg.Und.Grad", "NoProg.NoCred",
"Other", "Post.Und.Grad", "Post.Career", "Pre-U", "Career.Qual",
"Und.Grad", "Und.Grad.Qual"))
z
levels(z)<-c(rep("Other",3),rep("Grad",5),rep("Other",12))
z
temp$PROG2<-factor(x=temp$PROG,levels=levels(temp$PROG),labels=z)
temp
I am trying to calculate Month over Month % Revenue change on data rows using R. For example my current data is:
Booking.Date Revenue Month
4/1/2018 3160 April
4/1/2018 12656 April
4/1/2018 5157 April
5/8/2018 12152 May
5/8/2018 2824 May
5/8/2018 4600 May
6/30/2018 6936 June
6/30/2018 17298 June
6/30/2018 9625 June
I want to make a dynamic function in R which calculates the Revenue
MoM((Revenue_month2-Revenue_month1)/Revenue_month1)*100)
for any new month.
The output should be similar to:
Month Revenue_MoM
April 3%
May -8%
June 50%
and so on.
I got a data.table solution, only the ordering needs to be fixed, by making the month a proper date function. But it should give you an idea. Please keep in mind that for the first month there's no way to calculate a growth rate. I used the logarithmic growth rate, which is in my opinion the best way, but you can easily switch that to any other growth rate calculation.
library(data.table)
dt <- structure(list(Booking.Date = structure(c(1L, 1L, 1L, 2L, 2L,
2L, 3L, 3L, 3L), .Label = c("4/1/2018", "5/8/2018", "6/30/2018"
), class = "factor")
, Revenue = c(3160L, 12656L, 5157L, 12152L, 2824L, 4600L, 6936L, 17298L, 9625L)
, Month = structure(c(1L, 1L, 1L, 3L, 3L, 3L, 2L, 2L, 2L), .Label = c("April", "June", "May"), class = "factor"))
, row.names = c(NA, -9L)
, class = c("data.table","data.frame"))
# Change the month column into date one
# Setting the locale, so that the months can be converted
Sys.setlocale("LC_TIME", "en_US.UTF-8")
dt[, `:=`(Month.Date = as.Date(paste0("2018-",Month,"-01"), tryFormats = "%Y-%B-%d"))
dt[,.(Sum.Revenue = sum(Revenue)), by = list(Month.Date)][, .(Month.Date
, Sum.Revenue
, Change.Revenue = log(Sum.Revenue) - log(shift(Sum.Revenue, n =1L, type = "lag"))
)]
# Calculations, based on the normal growth rate calculation
dt[,.(Sum.Revenue = sum(Revenue)), by = list(Month.Date)][, .(Month.Date
, Sum.Revenue
, Change.Revenue = (Sum.Revenue - shift(Sum.Revenue, n =1L, type = "lag"))/shift(Sum.Revenue, n =1L, type = "lag")
)]
I want to test homoscedasticity using the levene.test function from the lawstat package specifically because I like the bootstrap option and the fact that it returns a list rather then the unmanageable output of car::leveneTest. From the help of lawstat::levene.test it is clear that the NAs should be omitted by default from my dataset. Below I provide the original data.
testset.logcount<-c(6.86923171973098, 6.83122969386706, 7.30102999566398,7.54282542695918,6.40823996531185, 6.52891670027766, 6.61278385671974, 6.71933128698373, 6.96567197122011, 6.34242268082221, 6.60205999132796, 6.69897000433602, 6.6232492903979, 6.54157924394658, 6.43136376415899, 6.91381385238372,6.44715803134222, 6.30102999566398, 6.10037054511756, 6.7481880270062,NA, 4.89762709129044,5.26951294421792, 5.12385164096709, 5.11394335230684, 4.43136376415899, 5.73957234445009, 5.83250891270624, 5.3451776165427, 5.77887447200274, 5.38524868240322, 5.75127910398334, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
testset.treat<-structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("CTL","TRM"), class = "factor")
when I execute lawstat::levene.test(y=testset.logcount,group=testset.treat) I get the following error message: Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
According to me the testset.treat clearly has two levels.
Also when using leveneTest(y=testset.logcount,group=testset.treat) or fligner.test(x=testset.logcount,g=testset.treat) both run without any errors.
I could not find out why I got this particular error with lawstat::levene.test and I am hoping that somebody here can help me out.
I am running R 3.0.0 on a x86_64-w64-mingw32/x64 platform (Windows 7, 64 bit).
For the record, this behavior was created by a bug in the function's attempt to remove NA values. It was attempting to do this using the code:
y <- y[!is.na(y)]
group <- group[!is.na(y)]
which, if there are NA values in y could be very bad. In this particular case it wiped out the second factor level.
It should be an easy fix, once reported.
After some playing around with levene.test(), I think the problem is the missing values.
test <- cbind(testset.logcount, testset.treat)
test <- test[complete.cases(test),] #removing Nas
levene.test(test[,1], test[,2])
modified robust Brown-Forsythe Levene-type test based on the absolute
deviations from the median
data: test[, 1]
Test Statistic = 0.9072, p-value = 0.3487
And this does match car's levene test (df = 29), so it must automatically delete missing rows
> leveneTest(y=testset.logcount,group=testset.treat)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 1 0.9072 0.3487
29
I also went through a similar process.
The aim is not to let the second factor be wiped out where 'y' is your numeric vector and 'g' the factor of the data.
yg = na.omit(data.frame(y,g))
y = yg[,1]
g = yg[,2]
I am working with a large data set of billing records for my clinical practice over 11 years. Quite a few of the rows are missing the referring physician. However, using some rules I can quite easily fill them in but do not know how to implement it in data.table under R. I know that there are things such as na.locf in the zoo package and self rolling join in the data.table package. The examples that I have seen are too simplistic and do not help me.
Here is some fictitious data to orient you (as a dput ASCII text representation)
structure(list(patient.first.name = structure(c(1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("John", "Kathy",
"Timothy"), class = "factor"), patient.last.name = structure(c(3L,
3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("Jones",
"Martinez", "Squeal"), class = "factor"), medical.record.nr = c(4563455,
4563455, 4563455, 4563455, 4563455, 2663775, 2663775, 2663775,
2663775, 2663775, 3330956, 3330956, 3330956, 3330956), date.of.service = c(39087,
39112, 39112, 39130, 39228, 39234, 39244, 39244, 39262, 39360,
39184, 39194, 39198, 39216), procedure.code = c(44750, 38995,
40125, 44720, 44729, 44750, 38995, 40125, 44720, 44729, 44750,
44729, 44729, 44729), diagnosis.code.1 = c(456.87, 456.87, 456.87,
456.87, 456.87, 521.37, 521.37, 521.37, 521.37, 356.36, 456.87,
456.87, 456.87, 456.87), diagnosis.code.2 = c(413, 413, 413,
413, 413, 532.23, NA, NA, NA, NA, NA, NA, NA, NA), referring.doctor.first = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, NA, NA, NA, 1L, 1L, NA), .Label = c("Abe",
"Mark"), class = "factor"), referring.doctor.last = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, NA, NA, NA, 1L, 1L, NA), .Label = c("Newstead",
"Wydell"), class = "factor"), referring.docotor.zip = c(15209,
15209, 15209, 15209, 15209, 15222, 15222, 15222, NA, NA, NA,
15209, 15209, NA), some.other.stuff = structure(c(1L, 1L, 1L,
NA, 3L, NA, NA, 4L, NA, 6L, NA, 2L, 5L, NA), .Label = c("alkjkdkdio",
"cheerios", "ddddd", "dddddd", "dogs", "lkjljkkkkk"), class = "factor")), .Names = c("patient.first.name",
"patient.last.name", "medical.record.nr", "date.of.service",
"procedure.code", "diagnosis.code.1", "diagnosis.code.2", "referring.doctor.first",
"referring.doctor.last", "referring.docotor.zip", "some.other.stuff"
), row.names = c(NA, 14L), class = "data.frame")
The obvious solution is to use some sort of last observation carried forward (LOCF) algorithm on referring.doctor.last and referring.doctor.first. However, it must stop when it gets to a new patient. In other words the LOCF must only be applied to one patient who is identified by the combination of patient.first.name, patient.last.name, medical.record.nr. Also note how some patients are missing the referring doctor on their very first visit so that means that some observations have to be carried backwards. To complicate matters some patients change primary care physicians and so there may be one referring doctor earlier on and another one later on. The alogorithm therefore needs to be aware of the date order of the rows with missing values.
In zoo na.locf I do not see an easy way to group the LOCF per patient. The rolling join examples that I have seen, would not work here becasuse I cannot simply take out the rows with the missing referring.doctor information since I would then loose date.of.service and procedure.code etcetera. I would love your help in learning how R can fill in my missing data.
A more concise example would have been easier to answer. For example you've included quite a few columns that appear to be redundant. Does it really need to be by first name and last name, or can we use the patient number?
Since you already have NAs in the data, that you wish to fill, it's not roll in data.table really. A rolling join is more for when your data has no NA but you have another time series (for example) that joins to positions inbetween the data. (One efficiency advantage there is the very fact you don't create NA first which you then have to fill in a 2nd step.) Or, in other words, in your question you just have one dataset; you aren't joining two.
So you do need na.locf as #Joshua suggested. I'm not aware of a function that fills NA forward and then the first value backwards, though.
In data.table, to use na.locf by group it's just :
require(data.table)
require(zoo)
DT[,doctor:=na.locf(doctor),by=patient]
which has the efficiency advantages of fast aggregation and update by reference. You would have to write a new small function on top of na.locf to roll the first non NA backwards.
Ensure the data is sorted by patient then date, first. Then the above will cope with changes in doctor over time, since by maintains the order of rows within each group.
Hope that gives you some hints.
#MatthewDowle has provided us with a wonderful starting point and here we will take it to its conclusion.
In a nutshell, use zoo's na.locf. The problem is not amenable to rolling joins.
setDT(bill)
bill[,referring.doctor.last:=na.locf(referring.doctor.last,na.rm=FALSE),
by=list(patient.last.name, patient.first.name, medical.record.nr)]
bill[,referring.doctor.last:=na.locf(referring.doctor.last,na.rm=FALSE,fromLast=TRUE),
by=list(patient.last.name, patient.first.name, medical.record.nr)]
Then do something similar for referring.doctor.first
A few pointers:
The by statement ensures that the last observation carried forward is restricted to the same patient so that the carrying does not "bleed" into the next patient on the list.
One must use the na.rm=FALSE argument. If one does not then a patient who is missing information for a referring physician on their very first visit will have the NA removed and the vector of new values (existing + carried forward) will be one element short of the number of rows. The shortened vector is recycled and everything gets shifted up and the last row gets the first element of the vector as it is recycled. In other words, a big mess. And worst of all you will only see it sometimes.
Use fromLast=TRUE to run through the column again. That fills in the NA that preceded any data. Instead of last observation carried forward (LOCF) zoo uses next observation carried backward (NOCB). Happiness - you have now filled in the missing data in a way that is correct for most circumstances.
You can pass multiple := per line, e.g. DT[,`:=`(new=1L,new2=2L,...)]