How can I streamline this R script? - r

I have a a big dataframe in R that all looks about like this:
name amount date1 date2 days_out year
JEAN 318.5 1971-02-16 1972-11-27 650 days 1971
GREGORY 1518.5 <NA> <NA> NA days 1971
JOHN 318.5 <NA> <NA> NA days 1971
EDWARD 318.5 <NA> <NA> NA days 1971
WALTER 518.5 1971-07-06 1975-03-14 1347 days 1971
BARRY 1518.5 1971-11-09 1972-02-09 92 days 1971
LARRY 518.5 1971-09-08 1972-02-09 154 days 1971
HARRY 318.5 1971-09-16 1972-02-09 146 days 1971
GARRY 1018.5 1971-10-26 1972-02-09 106 days 1971
If someone's days_out is less than 60, they get a 90% discount. 60-90, a 70% discount. I need to find out the discounted sum of all the amounts for each year. My utterly embarrassing workaround is to write a python script that writes an R script that reads like this for each relevant year:
tmp <- members[members$year==1971, ]
tmp90 <- tmp[tmp$days_out <= 60 & tmp$days_out > 0 & !is.na(tmp$days_out), ]
tmp70 <- tmp[tmp$days_out <= 90 & tmp$days_out > 60 & !is.na(tmp$days_out), ]
tmp50 <- tmp[tmp$days_out <= 120 & tmp$days_out > 90 & !is.na(tmp$days_out), ]
tmp30 <- tmp[tmp$days_out <= 180 & tmp$days_out >120 & !is.na(tmp$days_out), ]
tmp00 <- tmp[tmp$days_out > 180 | is.na(tmp$days_out), ]
details.1971 <- c(1971, nrow(tmp),
nrow(tmp90), sum(tmp90$amount), sum(tmp90$amount) * .9,
nrow(tmp70), sum(tmp70$amount), sum(tmp70$amount) * .7,
nrow(tmp50), sum(tmp50$amount), sum(tmp50$amount) * .5,
nrow(tmp30), sum(tmp30$amount), sum(tmp90$amount) * .9,
nrow(tmp00), sum(tmp00$amount))
membership.for.chart <- rbind(membership.for.chart,details.1971)
It works just fine. The tmp frames and vectors get overwritten which is fine. But I know that I've utterly defeated everything that is elegant and efficient about R here. I launched R for the first time a month ago and I think I've come a long way. But I would really like to know how I should have gone about this?

Wow, you wrote a Python script that generates an R script? Consider my eyebrows raised...
Hopefully this will get you started:
#Import your data; add dummy column to separate 'days' suffix into its own column
dat <- read.table(text = " name amount date1 date2 days_out dummy year
JEAN 318.5 1971-02-16 1972-11-27 650 days 1971
GREGORY 1518.5 <NA> <NA> NA days 1971
JOHN 318.5 <NA> <NA> NA days 1971
EDWARD 318.5 <NA> <NA> NA days 1971
WALTER 518.5 1971-07-06 1975-03-14 1347 days 1971
BARRY 1518.5 1971-11-09 1972-02-09 92 days 1971
LARRY 518.5 1971-09-08 1972-02-09 154 days 1971
HARRY 318.5 1971-09-16 1972-02-09 146 days 1971
GARRY 1018.5 1971-10-26 1972-02-09 106 days 1971",header = TRUE,sep = "")
#Repeat 3 times
df <- rbind(dat,dat,dat)
#Create new year variable
df$year <- rep(1971:1973,each = nrow(dat))
#Breaks for discount levels
ct <- c(0,60,90,120,180,Inf)
#Cut into a factor
df$fac <- cut(df$days_out,ct)
#Create discount amounts for each row
df$discount <- c(0.9,0.7,0.5,0.9,1)[df$fac]
df$discount[is.na(df$discount)] <- 1
#Calc adj amount
df$amount_adj <- with(df,amount * discount)
#I use plyr a lot, but there are many, many
# alternatives
library(plyr)
ddply(df,.(year),summarise,
amt = sum(amount_adj),
total = length(year),
d60 = length(which(fac == "(0,60]")))
I only calculated a few of your summary values in the last ddply command. I'm assuming you can extend it yourself.

You can use either the cut function or the findInterval function. The exact code will depend on the internals of the object which are not unambiguously communicated with console output. If that days_out is a difftime-object. then something like this might work:
disc_amt <- with(tmp, amount*c(.9, .7, .5, .9, 1)[
findInterval(days_out, c(0, 60, 90, 120, 180, Inf] )
You should post the output of dput() on that tmp object or perhaps dput(head(tmp, 20)) if its really big, and testing can proceed. (The actual discounts did not seem to be ordered in a manner I would have expected.)

Related

How to make this simple function more efficient?

I have data on wages and about 95% of them are given in hourly format, however some of them are given as an annual salary. So I made a function to convert the annual salaries to hourly, however it takes 1 min 40 sec to run, when my dataset is 43000 rows x 12 columns (which I didnt think would be too big) so I did not think it would take this long.
I am curious if there is a better way to do this than the current function I have created. I am new with dplyr and tidyverse so ideally an answer using those capabilities.
Here is some sample data:
NOC4 Region Region_Name Wage_2012 Wage_2013 Wage_2014
0011 ER10 National 28.1 65000 NA
0011 ER1010 Northern NA 30.5 18
0011 ER1020 Southern 42.3 72000 22
0011 ER1030 Eastern 12 NA 45500
0011 ER1040 Western 8 NA 99000
0011 ER10 National NA 65000 NA
Here is what it should look like after the function:
NOC4 Region Region_Name Wage_2012 Wage_2013 Wage_2014
0011 ER10 National 28.1 33.33 NA
0011 ER1010 Northern NA 30.5 18
0011 ER1020 Southern 42.3 36.92 22
0011 ER1030 Eastern 12 NA 23.33
0011 ER1040 Western 8 NA 50.77
0011 ER10 National NA 33.33 NA
Here is the function:
year_to_hour <- function(dataset, salary, startcol){
# where "startcol" should be the first column containing the numeric
# values that you are trying to convert.
for(i in startcol:ncol(dataset)){
for(j in 1:nrow(dataset)){
if(is.na(dataset[j, i])){
j = j+1
}else if(as.numeric(dataset[j, i]) >= as.numeric(salary)){
dataset[j, i] = dataset[j, i]/1950
}
else{
dataset[j, i] = dataset[j, i]
}
}
}
return(as_tibble(dataset))
}
converted <- year_to_hour(wage_data_messy, 1000, 4)
R will work much faster if you let it handle the loops under the hood through "vectorized" code.
http://www.noamross.net/blog/2014/4/16/vectorization-in-r--why.html
Here's an approach using dplyr:
library(dplyr)
salary <- 1000
df %>%
mutate_at(vars(Wage_2012:Wage_2014), # For these columns...
~ . / if_else(. > salary, 1950, 1)) # Divide by 1950 if > salary
Using dplyr I would use mutate_if
salary <- 1000
df %>% mutate_if(is.numeric, ~ifelse(. > salary, ./1950, .))

How to find correlation in a data set

I wish to find the correlation of the trip duration and age from the below data set. I am applying the function cor(age,df$tripduration). However, it is giving me the output NA. Could you please let me know how do I work on the correlation? I found the "age" by the following syntax:
age <- (2017-as.numeric(df$birth.year))
and tripduration(seconds) as df$tripduration.
Below is the data. the number 1 in gender means male and 2 means female.
tripduration birth year gender
439 1980 1
186 1984 1
442 1969 1
170 1986 1
189 1990 1
494 1984 1
152 1972 1
537 1994 1
509 1994 1
157 1985 2
1080 1976 2
239 1976 2
344 1992 2
I think you are trying to subtract a number by a data frame, so it would not work. This worked for me:
birth <- df$birth.year
year <- 2017
age <- year - birth
cor(df$tripduration, age)
>[1] 0.08366848
# To check coefficient
cor(dat$tripduration, dat$birth.year)
>[1] -0.08366848
By the way, please format the question with an easily replicable data where people can just copy and paste to their R. This actually helps you in finding an answer.
Based on the OP's comment, here is a new suggestion. Try deleting the rows with NA before performing a correlation test.
df <- df[complete.cases(df), ]
age <- (2017-as.numeric(df$birth.year))
cor(age, df$tripduration)
>[1] 0.1726607

R- combine rows of a data frame to be unique by 3 columns

I have data frame looking like this:
> head(temp)
VisitIDCode start stop Value_EVS hr heart rate NU EE0A Value_EVS temp celsius CAL 113C Value_EVS current weight kg CAL
23642 2008253059 695 696 <NA> 36.4 <NA>
24339 2008253059 695 696 132 <NA> <NA>
72450 2008953178 527 528 <NA> 38.6 <NA>
72957 2008953178 527 528 123 <NA> <NA>
73976 2008965669 527 528 <NA> 36.2 <NA>
74504 2008965669 527 528 116 <NA> <NA>
First and second row are both for the same patient(same VisitIDCode), in the first row I have the value of heart rate and in the second I have the value of temperature from time 2 to 3. I want to combine these rows so that the result is one row that looks like:
VisitIDCode start stop Value_EVS hr heart rate NU EE0A Value_EVS temp celsius CAL 113C Value_EVS current weight kg CAL
23642 2008253059 695 696 132 36.4 <NA>
In other words, I want my data frame to be unique by combination of VisitIDCode, start and stop. This is a large dataframe with more columns that need to be combined.
What is the best way of doing it and if at all possible, avoiding for loop?
Edit: I don't want to remove the NAs. If there are 2 rows each of which have one value and 2 NAs, I want to combine them to one row so it has two values and one NA. Like the example above.
nasim,
It's useful to create a reproducible example when posting questions. It makes it much easier to sort out how to help. I created a toy example here. Hopefully, that reproduces your issue:
> df <- data.frame(MRN = c(123,125,213,214),
+ VID = c(2008,2008,2011,2011),
+ start=c(695,695),
+ heart.rate = c(NA,112,NA,96),
+ temp = c(39.6,NA,37.4,NA))
> df
MRN VID start heart.rate temp
1 123 2008 695 NA 39.6
2 125 2008 695 112 NA
3 213 2011 695 NA 37.4
4 214 2011 695 96 NA
Here is a solution using dplyr:
> library(dplyr)
> df <- df %>%
+ group_by(VID) %>%
+ summarise(MRN = max(MRN,na.rm=T),
+ start=max(start,na.rm=T),
+ heart.rate=max(heart.rate,na.rm=T),
+ temp = max(temp,na.rm=T))
> df
# A tibble: 2 × 5
VID MRN start heart.rate temp
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2008 125 695 112 39.6
2 2011 214 695 96 37.4
After I made sure all columns classes are numeric (not factors) by defining the classes of columns while reading the data in, this worked for me:
CompleteCoxObs<-aggregate(x=CompleteCoxObs[c("stop","Value_EVS current weight kg CAL","Value_EVS hr heart rate NU EE0A","Value_EVS temp celsius CAL 113C")], by=list(VisitIDCode=CompleteCoxObs$VisitIDCode,start=CompleteCoxObs$start), max, na.rm = FALSE);

I want to use R to sample my timestamped dataframe

I want to use R to sample my dataframe. My data is timestamped epidemiological data, and I want to randomly sample at least 1 and as many as 10 records for each year, preferably in a manner that is scaled to the number of records for each year. I would like to export the results as a csv.
here are a few lines of my dataset, where I've left off the long genetic sequence field for each record.
year matrix USD clade
1958 W mG018U UP
1958 W mG018U UP
1958 W mG018U UP
1966 UN mG140L LL
1969 UN mG207L LL
1969 UN mG013L LL
1971 UN mG208L LL
1972 HA mG129M MN
1973 C1 mG018U UP
1973 NA mG001U UC
1973 NA mG001U UC
all I've learned to do is
sample(mydata, size = 600, replace = FALSE)
which doesn't of course take the year into account.
There are many possibilities to run sample per group (for example sample_n in the dplyr package), here's an illustration using the data.table package.
You can set a fraction of, let's say 0.1, of the amount of the records you want to sample out of each year so the size will be relative, wrap it up in ceiling in case this fraction is smaller than 1, and restrict to maximum 10 per group using the min function, for example
library(data.table)
setDT(df)[, .SD[sample(.N, min(10, ceiling(.N*.1)))], year]
# year matrix USD clade
#1: 1958 W mG018U UP
#2: 1966 UN mG140L LL
#3: 1969 UN mG013L LL
#4: 1971 UN mG208L LL
#5: 1972 HA mG129M MN
#6: 1973 NA mG001U UC

Creating lag variables for matched factors

I have a question about creating lag variables depending on a time factor.
Basically I am working with a baseball dataset where there are lots of names for each player between 2002-2012. Obviously I only want lag variables for the same person to try and create a career arc to predict the current stat. Like for example I want to use lag 1 Average (2003) , lag 2 Average (2004) to try and predict the current average in 2005. So I tried to write a loop that goes through every row (the data frame is already sorted by name and then year, so the previous year is n-1 row), check if the name is the same, and if so then grab the value from the previous row.
Here is my loop:
i=2 #as 1 errors out with 1-0 row
for(i in 2:6264){
if(TS$name[i]==TS$name[i-1]){
TS$runvalueL1[i]=TS$Run_Value[i-1]
}else{
TS$runvalueL1 <- NA
}
i=i+1
}
Because each row is dependent on the name I cannot use most of the lag functions. If you have a better idea I am all ears!
Sample Data won't help a bunch but here is some:
edit: Sample data wasn't producing useable results so I just attached the first 10 people of my dataset. Thanks!
TS[(6:10),c('name','Season','Run_Value')]
name Season ARuns
321 Abad Andy 2003 -1.05
3158 Abercrombie Reggie 2006 27.42
1312 Abercrombie Reggie 2007 7.65
1069 Abercrombie Reggie 2008 5.34
4614 Abernathy Brent 2002 46.71
707 Abernathy Brent 2003 -2.29
1297 Abernathy Brent 2005 5.59
6024 Abreu Bobby 2002 102.89
6087 Abreu Bobby 2003 113.23
6177 Abreu Bobby 2004 128.60
Thank you!
Smth along these lines should do it:
names = c("Adams","Adams","Adams","Adams","Bobby","Bobby", "Charlie")
years = c(2002,2003,2004,2005,2004,2005,2010)
Run_value = c(10,15,15,20,10,5,5)
library(data.table)
dt = data.table(names, years, Run_value)
dt[, lag1 := c(NA, Run_value), by = names]
# names years Run_value lag1
#1: Adams 2002 10 NA
#2: Adams 2003 15 10
#3: Adams 2004 15 15
#4: Adams 2005 20 15
#5: Bobby 2004 10 NA
#6: Bobby 2005 5 10
#7: Charlie 2010 5 NA
An alternative would be to split the data by name, use lapply with the lag function of your choice and then combine the splitted data again:
TS$runvalueL1 <- do.call("rbind", lapply(split(TS, list(TS$name)), your_lag_function))
or
TS$runvalueL1 <- do.call("c", lapply(split(TS, list(TS$name)), your_lag_function))
But I guess there is also a nice possibility with plyr, but as you did not provide a reproducible example, that is all for the beginning.
Better:
TS$runvalueL1 <- unlist(lapply(split(TS, list(TS$name)), your_lag_function))
This is obviously not a problem where you want to create a matrix with cbind, so this is a better data structure:
full=data.frame(names, years, Run_value)
The ave function is quite useful for constructing new columns within categories of other columns:
full$Lag1 <- ave(full$Run_value, full$names,
FUN= function(x) c(NA, x[-length(x)] ) )
full
names years Run_value Lag1
1 Adams 2002 10 NA
2 Adams 2003 15 10
3 Adams 2004 15 15
4 Adams 2005 20 15
5 Bobby 2004 10 NA
6 Bobby 2005 5 10
7 Charlie 2010 5 NA
I thinks it's safer to cionstruct with NA, since that will help prevent errors in logic that using 0 for prior years in year 1 would not alert you to.

Resources