I have a script in r to calculate body condition residuals. I would like to apply this code to each columns, which correspond to a specific category of individual.
For example i would run this code to calculate body condition residuals of all individuals that are in the category 1
1) Select rows of interest
Data1 = RawData %>% select(ID,temperature, Bodysize1, mass1, year) %>% filter((temperature %in% c(20:29) & Bodysize1 %in% c(20:100) & mass1 %in% c(15:40))
2) Create a new model with created data
Model1 =lmer(log(mass1) ~ log(Bodysize1) + temperature + (1|year), data = Data1)
3) Extract residuals and add ID to the residuals
ResModel1 = resid(Model1)
ID=Data1$ID
Res1 =data.frame(ResModel1 ,ID)
4) Add residuals to my RawData
RawData2.0 = merge(RawData, Res1, by = c("ID"), all.x = T)
In order to avoid reruning this code and manually changing all the 1 by 2 and then all the 2 by 3... etc is there a way to do this commande automatic whith loops and the apply familly?
My data
ID TEMPERATURE BODYSIZE1 MASS1 BODYSIZE2 MASS2 YEAR
81-012 0.03830645 200 1450 205 1425 1981
84-069 0.26923078 200 1473 205 1498 1984
84-134 0.32692307 209 1448 195 1323 1984
84-145 0.27884614 197 1373 197 1498 1984
84-190 0.31129807 191 1248 195 1323 1984
85-155 0.33056709 198 1637 229 1988 1985
Thanks in advance
Withou example data it is tough to say if this will work but maybe creating a function can simplify your workflow
library(tidyverse)
get_resid <- function(df,filters) {
df_to_model <- df %>% filter({{filters}})
df_to_keep <- df <- filter({{filters}},.preserve = FALSE)
Data1 <- df_to_model %>%
select(ID,temperature, Bodysize1, mass1, year)
Model1 <- lmer(log(mass1) ~ log(Bodysize1) + temperature + (1|year), data = Data1)
ResModel1 <- resid(Model1)
ID <- Data1$ID
Res1 <- data.frame(ResModel1 ,ID)
Res1 %>%
bind_rows(df_to_keep)
}
Then you may use this this function in your pipes
RawData %>%
get_resid(temperature %in% c(20:29) & Bodysize1 %in% c(20:100) & mass1 %in% c(15:40))
You might try to i) create a tibble, in the first column list all dep variables as strings,
ii) list your models of indep vars in the second column,
iii) create a formula in the third column
iv) run your model in the fourth column
df <- tibble(dep = paste0("log(var",seq(1,10,1),")"),
x = "~ your_x_vars") %>%
mutate(formula = as.formula(paste0(dep,x))) %>%
mutate(reg = map(formula, ~lm(as.formula(.x), data=df) ))
then you can easily extract the residuals
Related
I have a large dataframe with multiple columns (about 150).
There is a range of columns (Dx1, Dx2..until Dx30) which are diagnosis codes (the codes are numbers, but they are categorical variables that correspond to a medical diagnosis using the ICD-9 coding system).
I have working code to search a single column, but need to search all 30 columns to see if any of the columns contain a code within the specified range (DXrange).
The core dataframe looks like:
Case DX1 DX2 DX3 DX4...DX30
1 123 345 567 99 12
2 234 345 NA NA NA
3 456 567 789 345 34
Here is the working code:
## Defines a range of codes to search for
DXrange <- factor(41000:41091, levels = levels(core$DX1))
## Search for the DXrange codes in column DX1.
core$IndexEvent <- core$DX1 %in% DXrange & substr(core$DX1, 5, 5) != 2
## What is the frequency of the IndexEvent?
cat("Frequency of IndexEvent : \n"); table(core$IndexEvent)
The working code is adapted from "Calculating Nationwide Readmissions Database (NRD) Variances, Report # 2017-01"
I could run this for each DX column and then sum them for a final IndexEvent total, but this is not very efficient.
I would first normalize my data, before searching in the codes, such as the following example:
set.seed(314)
df <- data.frame(id = 1:5,
DX1 = sample(1:10,5),
DX2 = sample(1:10,5),
DX3 = sample(1:10,5))
require(dplyr)
require(tidyr)
df %>%
gather(key,value,-id) %>%
filter(value %in% 1:2)
or with just base R
df.long <- do.call(rbind,lapply(df[,2:4],function(x) data.frame(id = df$id, DX = x)))
df.long[df.long$DX %in% 1:2, ]
We could use filter_at with any_vars
df %>%
filter_at(vars(matches("DX\\d+")), any_vars(. %in% DXrange))
where
DXrange <- 41000:41091
I am writing a generic function which takes dataframe and column name and return the clean dataframe without outliers in R
cooks_dist <- function(dataframe,column){
dataframe <- dataframe %>% select_if(dataframe,is.numeric)
mod <- lm(column ~ ., data=dataframe)
cooksd <- cooks.distance(mod)
influential <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd,na.rm=T))]) # influential row numbers
final <- dataframe[-influential,]
return(final)
}
But,when I run this function it says Error: Can't convert a list to function
Data can be found at
http://ucanalytics.com/blogs/wp-content/uploads/2016/09/Regression-Clean-Data.csv
The error originated from dplyr::select_if(). I believe you want a subset of all numeric columns so you alternatively could create a subset with sapply(). Note: As your lm() line produced errors, I`ve inserted the minimal model instead.
So I think you want this:
cooks_dist <- function(dataframe, column){
dataframe <- dataframe[, sapply(dataframe, is.numeric)]
mod <- lm(dataframe[, column] ~ 1, data = dataframe)
cooksd <- cooks.distance(mod)
influential <- as.numeric(names(cooksd)[(cooksd > 4 * mean(cooksd, na.rm = TRUE))])
final <- dataframe[-influential, ]
return(final)
}
df1 <- cooks_dist(df1, 4)
Yields:
> head(df1)
X Observation Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup Rainfall House_Price
2 2 2 8294 8186 12694 1461 1752 210 3982000
3 3 3 11001 14399 16991 1340 1609 720 5401000
4 4 4 8301 11188 12289 1451 1748 620 5373000
5 5 5 10510 12629 13921 1770 2111 450 4662000
7 7 7 13153 11869 17811 1542 1858 1030 7224000
8 8 8 5882 9948 13315 1261 1507 1020 3772000
I used this code, with threshold for cooks as 4/n:
orig.mod <- lm(Outcome ~ Exposure, data=origdf)
origdf$cooksd <- cooks.distance(orig.mod)
origdf$cookyn <- ifelse(origdf$cooksd < 4/nrow(orig.dat), "keep","no")
minus.df <-subset(origdf, cookyn=="keep")
newmod.minuscooks <- lm(Outcome ~ Exposure, data=minus.df)
I'm new in R and what I want to do is something very simple but I need help.
I have a database that looks like the one above; where spot number = "name" of a protein, grupo = group I and II and APF = fluorescent reading.
I want to do a tstudent test to each protein, by comparing groups I and II, but in a loop.
In the database above there only 1 protein (147) but im my real database i have 444 proteins.
Starting with some fake data:
set.seed(0)
Spot.number <- rep(147:149, each=10)
grupo <- rep(rep(1:2, each=5), 3)
APF <- rnorm(30)
gel <- data.frame(Spot.number, grupo, APF)
> head(gel)
Spot.number grupo APF
1 147 1 2.1780699
2 147 1 -0.2609347
3 147 1 -1.6125236
4 147 1 1.7863384
5 147 1 2.0325473
6 147 2 0.6261739
You can use lapply to loop through the subsets of gel, split by the Spot.number:
tests <- lapply(split(gel, gel$Spot.number), function(spot) t.test(APF ~ grupo, spot))
or just
tests <- by(gel, gel$Spot.number, function(spot) t.test(APF ~ grupo, spot))
You can then move on to e.g. taking only the p values:
sapply(tests, "[[", "p.value")
# 147 148 149
#0.2941609 0.9723856 0.5726007
or confidence interval
sapply(tests, "[[", "conf.int")
# 147 148 149
# [1,] -0.985218 -1.033815 -0.8748502
# [2,] 2.712395 1.066340 1.4240488
And the resulting vector or matrix will already have the Spot.number as names which can be very helpful.
You can perform a t.test within each group using dplyr and my broom package. If your data is stored in a data frame called dat, you would do:
library(dplyr)
library(broom)
results <- dat %>%
group_by(Spot.number) %>%
do(tidy(t.test(APF ~ grupo, .)))
This works by performing t.test(APF ~ grupo, .) on each group defined by Spot.number. The tidy function from broom then turns it into a one-row data frame so that it can be recombined. The results data frame will then contain one row per protein (Spot.number) with columns including estimate, statistic, and p.value.
See this vignette for more on the combination of dplyr and broom.
I have the following code:
library(dplyr)
library(quantmod)
# inflation data
getSymbols("CPIAUCSL", src='FRED')
avg.cpi <- apply.yearly(CPIAUCSL, mean)
cf <- avg.cpi/as.numeric(avg.cpi['1991']) # using 1991 as the base year
cf <- as.data.frame(cf)
cf$year <- rownames(cf)
cf <- tail(cf, 25)
rownames(cf) <- NULL
cf$year <- lapply(cf$year, function(x) as.numeric(head(unlist(strsplit(x, "-")), 1)))
rm(CPIAUCSL)
# end of inflation data get
tmp <- data.frame(year=c(rep(1991,2), rep(1992,2)), price=c(12.03, 12.98, 14.05, 14.58))
tmp %>% mutate(infl.price = price / cf[cf$year == year, ]$CPIAUCSL)
I want to get the following result:
year price
1991 12.03
1991 12.98
1992 13.64
1992 14.16
But I'm getting an error:
Warning message:
In cf$year == tmp$year :
longer object length is not a multiple of shorter object length
And with %in% it produces and incorrect result.
I think it might be easier to join the CPIAUCSL column in cf into tmp before you try to mutate:
cf$year = as.numeric(cf$year)
tmp = tmp %>% inner_join(cf, by = "year") %>% mutate(infl.price = price / CPIAUCSL)
Your cf structure is a list of lists which is unfriendly. It woud have been nicer to have
cf$year <- sapply(cf$year, function(x) as.numeric(head(unlist(strsplit(x, "-")), 1)))
which at least returns a simple vector.
Additional, the subsetting operator [] is not properly vectorized for this type of operation. The mutate() function does not iterate over rows, it operates on entire columns at a time. When you do
cf[cf$year == year, ]$CPIAUCSL
There is not just one year value, mutate is trying to do them all at once.
You'd be better off doing a proper merge with your data and then do the mutate. This will basically do the same thing as your pseudo-merge that you were trying to do in your version.
You can do
tmp %>% left_join(cf) %>%
mutate(infl.price = price / CPIAUCSL) %>%
select(-CPIAUCSL)
to get
year price infl.price
1 1991 12.03 12.03000
2 1991 12.98 12.98000
3 1992 14.05 13.63527
4 1992 14.58 14.14962
I am running into a sticky spot trying to solve for variance accounted for by trend several times within a single data set.....
My data is structured like this
x <- read.table(text = "
STA YEAR VALUE
a 1968 457
a 1970 565
a 1972 489
a 1974 500
a 1976 700
a 1978 650
a 1980 659
b 1968 457
b 1970 565
b 1972 350
b 1974 544
b 1976 678
b 1978 650
b 1980 690
c 1968 457
c 1970 565
c 1972 500
c 1974 600
c 1976 678
c 1978 670
c 1980 750 " , header = T)
and I am trying to return something like this
STA R-sq
a n1
b n2
c n3
where n# is the corresponding r-squared value of the locations data in the original set....
I have tried
fit <- lm(VALUE ~ YEAR + STA, data = x)
to give the model of yearly trend of VALUE for each individual station over the years data is available for VALUE, within the master data set....
Any help would be greatly appreciated.... I am really stumped on this one and I know it is just a familiarity with R problem.
To get r-squared for VALUE ~ YEAR for each group of STA, you can take this previous answer, modify it slightly and plug-in your values:
# assuming x is your data frame (make sure you don't have Hmisc loaded, it will interfere)
models_x <- dlply(x, "STA", function(df)
summary(lm(VALUE ~ YEAR, data = df)))
# extract the r.squared values
rsqds <- ldply(1:length(models_x), function(x) models_x[[x]]$r.squared)
# give names to rows and col
rownames(rsqds) <- unique(x$STA)
colnames(rsqds) <- "rsq"
# have a look
rsqds
rsq
a 0.6286064
b 0.5450413
c 0.8806604
EDIT: following mnel's suggestion here are more efficient ways to get the r-squared values into a nice table (no need to add row and col names):
# starting with models_x from above
rsqds <- data.frame(rsq =sapply(models_x, '[[', 'r.squared'))
# starting with just the original data in x, this is great:
rsqds <- ddply(x, "STA", summarize, rsq = summary(lm(VALUE ~ YEAR))$r.squared)
STA rsq
1 a 0.6286064
2 b 0.5450413
3 c 0.8806604
#first load the data.table package
library(data.table)
#transform your dataframe to a datatable (I'm using your example)
x<- as.data.table(x)
#calculate all the metrics needed (r^2, F-distribution and so on)
x[,list(r2=summary(lm(VALUE~YEAR))$r.squared ,
f=summary(lm(VALUE~YEAR))$fstatistic[1] ),by=STA]
STA r2 f
1: a 0.6286064 8.462807
2: b 0.5450413 5.990009
3: c 0.8806604 36.897258
there's only one r-squared value, not three.. please edit your question
# store the output
y <- summary( lm( VALUE ~ YEAR + STA , data = x ) )
# access the attributes of `y`
attributes( y )
y$r.squared
y$adj.r.squared
y$coefficients
y$coefficients[,1]
# or are you looking to run three separate
# lm() functions on 'a' 'b' and 'c' ..where this would be the first?
y <- summary( lm( VALUE ~ YEAR , data = x[ x$STA %in% 'a' , ] ) )
# access the attributes of `y`
attributes( y )
y$r.squared
y$adj.r.squared
y$coefficients
y$coefficients[,1]