R: data.table using a for loop to wrangle multiple columns - r

I am currently working in R to build a for loop which will add the year to 7 columns that contain partial dates (dd/mm). I have been attempting to run the following for-loop and have not been successful. What am I doing wrong?
Here's a sample of what my data set looks like (The actual data set includes columns HomDate - HomDate_7 but I only included the first few as I know you'll get the point...)
Participant DateVisit HomDate HomDate_2 HomeDate_3 year_flag
1 2012-04-25 18/04 19/04 20/04 NA
2 2012-01-04 28/12 29/12 30/12 1
3 2012-01-05 31/12 01/01 01/02 1
4 2012-06-13 06/06 07/06 08/06 NA
5 2012-02-12 05/02 06/02 07/02 NA
Here's the code I've been trying to use:
hom_date <- list("HomDate", "HomDate_2", "HomDate_3", "HomDate_4", "HomDate_5", "HomDate_6",
"HomDate_7")
set_dates <- function(x){
home_morbid[,x:=as.character(x)]
home_morbid[(substr(x, 4, 5)==12) & (year_flag==1), x:=paste(x, "/2011", sep="")]
home_morbid[(substr(x, 4, 5)==01) & (year_flag==1), x:=paste(x, "/2012", sep="")]
home_morbid[is.na(year_flag), x:=paste(x, "/", substr(DateVisit, 1, 4), sep="")]
}
for(i in 1:length(hom_date)){
x <- hom_date[i]
home_morbid_2<-set_dates(x)
}

I'm not sure what happens to those with an NA flag. Here is an approach:
to_replace<-grep("^Hom",names(df))
df[,(to_replace):=lapply(.SD, function(x) ifelse(is.na(year_flag),x,
ifelse(substr(x, 4, 5)==12,
paste0(x,"/","2011"),
paste0(x,"/","2012")))),
.SDcols=HomDate:HomeDate_3][]
Participant DateVisit HomDate HomDate_2 HomeDate_3 year_flag
1: 1 2012-04-25 18/04 19/04 20/04 NA
2: 2 2012-01-04 28/12/2011 29/12/2011 30/12/2011 1
3: 3 2012-01-05 31/12/2011 01/01/2012 01/02/2012 1
4: 4 2012-06-13 06/06 07/06 08/06 NA
5: 5 2012-02-12 05/02 06/02 07/02 NA
To replace NA flagged years with the year from DateVisit:
library(lubridate)
to_replace<-grep("^Hom",names(df))
df[,(to_replace):=lapply(.SD, function(x) ifelse(is.na(year_flag),
paste0(x,"/",year(ymd(DateVisit))),
ifelse(substr(x, 4, 5)==12,
paste0(x,"/","2011"),
paste0(x,"/","2012")))),
.SDcols=HomDate:HomeDate_3][]
Participant DateVisit HomDate HomDate_2 HomeDate_3 year_flag
1: 1 2012-04-25 18/04/2012 19/04/2012 20/04/2012 NA
2: 2 2012-01-04 28/12/2011 29/12/2011 30/12/2011 1
3: 3 2012-01-05 31/12/2011 01/01/2012 01/02/2012 1
4: 4 2012-06-13 06/06/2012 07/06/2012 08/06/2012 NA
5: 5 2012-02-12 05/02/2012 06/02/2012 07/02/2012 NA

Related

Reference the previous non-zero row, find the difference and divide by nrows

I must be asking the question terribly because I can't find what I looking for!
I have a large excel file that looks like this for every day of the month:
Date
Well1
1/1/16
10
1/2/16
NA
1/3/16
NA
1/4/16
NA
1/5/16
20
1/6/16
NA
1/7/16
25
1/8/16
NA
1/9/16
NA
1/10/16
35
etc
NA
I want to make a new column that has the difference between the non-zero rows and divide that by the number of rows between each non zero row. Aiming for something like this:
Date
Well1
Adjusted
1/1/16
10
=(20-10)/4 = 2.5
1/2/16
NA
1.25
1/3/16
NA
1.25
1/4/16
NA
1.25
1/5/16
20
=(25-20)/2= 2.5
1/6/16
NA
2.5
1/7/16
25
=(35-25)/3 = 3.3
1/8/16
NA
3.3
1/9/16
NA
3.3
1/10/16
35
etc
etc
NA
etc
I'm thinking I should use lead or lag, but the thing is that the steps are different between each nonzero row (so I'm not sure how to use n in the lead/lag function). I've used group_by so that each month stands alone, as well as attempted case_when and ifelse Mostly need ideas on translating excel format into a workable R format.
With some diff-ing and repeating of values, you should be able to get there.
dat$Date <- as.Date(dat$Date, format="%m/%d/%y")
nas <- is.na(dat$Well1)
dat$adj <- with(dat[!nas,],
diff(Well1) / as.numeric(diff(Date), units="days")
)[cumsum(!nas)]
# Date Well1 adj
#1 2016-01-01 10 2.5
#2 2016-01-02 NA 2.5
#3 2016-01-03 NA 2.5
#4 2016-01-04 NA 2.5
#5 2016-01-05 20 2.5
#6 2016-01-06 NA 2.5
#7 2016-01-07 25 5.0
#8 2016-01-08 NA 5.0
#9 2016-01-09 NA 5.0
#10 2016-01-10 40 NA
dat being used is:
dat <- read.table(text="Date Well1
1/1/16 10
1/2/16 NA
1/3/16 NA
1/4/16 NA
1/5/16 20
1/6/16 NA
1/7/16 25
1/8/16 NA
1/9/16 NA
1/10/16 40", header=TRUE, stringsAsFactors=FALSE)
Base R in the same vein as #thelatemail but with transformations all in one expression:
nas <- is.na(dat$Well1)
res <- within(dat, {
Date <- as.Date(Date, "%m/%d/%y")
Adjusted <- (diff(Well1[!nas]) /
as.numeric(diff(Date[!nas]), units = "days"))[cumsum(!nas)]
}
)
Data:
dat <- read.table(text="Date Well1
1/1/16 10
1/2/16 NA
1/3/16 NA
1/4/16 NA
1/5/16 20
1/6/16 NA
1/7/16 25
1/8/16 NA
1/9/16 NA
1/10/16 40", header=TRUE, stringsAsFactors=FALSE)
Maybe this should work
library(dplyr)
df1 %>%
#// remove the rows with NA
na.omit %>%
# // create a new column with the lead values of Well1
transmute(Date, Well2 = lead(Well1)) %>%
# // join with original data
right_join(df1 %>%
mutate(rn = row_number())) %>%
# // order by the original order
arrange(rn) %>%
# // create a grouping column based on the NA values
group_by(grp = cumsum(!is.na(Well1))) %>%
# // subtract the first element of Well2 with Well1 and divide
# // by number of rows - n() in the group
mutate(Adjusted = (first(Well2) - first(Well1))/n()) %>%
ungroup %>%
select(-grp, - Well2)

How to create data.frame with different number of rows RData

I have a file (format RData).https://stepik.org/media/attachments/course/724/all_data.Rdata This file contains 7 lists with id and temperature of patients.
I need to make one data.frame from these lists and then remove all rows with NA
id temp i.temp i.temp.1 i.temp.2 i.temp.3 i.temp.4 i.temp.5
1: 1 36.70378 36.73161 36.22944 36.05907 35.66014 37.32798 35.88121
2: 2 36.43545 35.96814 36.86782 37.20890 36.45172 36.82727 36.83450
3: 3 36.87599 36.38842 36.70508 37.44710 36.73362 37.09359 35.92993
4: 4 36.17120 35.95853 36.33405 36.45134 37.17186 36.87482 35.45489
5: 5 37.20341 37.04881 36.53252 36.22922 36.78106 36.89219 37.13207
6: 6 36.12201 36.53433 37.29784 35.96451 36.70838 36.58684 36.60122
7: 7 36.92314 36.16220 36.48154 37.05324 36.57829 36.24955 37.23835
8: 8 35.71390 37.26879 37.01673 36.65364 36.89143 36.46331 37.15398
9: 9 36.63558 37.03452 36.40129 37.53705 36.03568 36.78083 36.71873
10: 10 36.77329 36.07161 36.42992 36.20715 36.78880 36.79875 36.15004
11: 11 36.66199 36.74958 36.28661 36.72539 36.17700 37.47495 35.60980
12: 12 NA 36.97689 36.00473 36.64292 35.96789 36.73904 36.93957
13: 13 NA NA NA NA NA 36.63760 36.83916
14: 14 37.40307 35.89668 36.30619 36.64382 37.21882 35.87420 35.45550
15: 15 NA NA NA 37.03758 36.72512 36.45281 37.54388
16: 16 NA 36.44912 36.57126 36.20703 36.83076 36.48287 35.99391
17: 17 NA NA NA 36.39900 36.54043 36.75989 36.47079
18: 18 36.51696 37.09903 37.31166 36.51000 36.42414 36.87976 36.45736
19: 19 37.05117 37.42526 36.15820 36.11824 37.07024 36.60699 36.80168
20: 20 NA NA NA NA NA NA 36.74118
I wrote:
load("https://stepik.org/media/attachments/course/724/all_data.Rdata")
library(data.table)
day1<-as.data.table(all_data[1])
day2<-as.data.table(all_data[2])
day3<-as.data.table(all_data[3])
day4<-as.data.table(all_data[4])
day5<-as.data.table(all_data[5])
day6<-as.data.table(all_data[6])
day7<-as.data.table(all_data[7])
setkey(day1, id)
setkey(day2, id)
setkey(day3, id)
setkey(day4, id)
setkey(day5, id)
setkey(day6, id)
setkey(day7, id)
all_day<-day1[day2,][day3, ][day4,][day5,][day6,][day7,]
all_day<-na.omit(all_day)
But it takes too long. How can I make it faster?
here is a data.table solution
library( data.table )
#set names for all_data
names( all_data ) <- paste0( "day", 1:length(all_data))
#bind lists to data.table
DT <- data.table::rbindlist( all_data, use.names = TRUE, fill = TRUE, idcol = "day" )
#cast to wide
ans <- dcast( DT, id ~ day, value.var = "temp" )
#only keep complete rows and present output (using [] at the end)
ans[ complete.cases( ans ), ][]
# id day1 day2 day3 day4 day5 day6 day7
# 1: 1 36.70378 36.73161 36.22944 36.05907 35.66014 37.32798 35.88121
# 2: 2 36.43545 35.96814 36.86782 37.20890 36.45172 36.82727 36.83450
# 3: 3 36.87599 36.38842 36.70508 37.44710 36.73362 37.09359 35.92993
# 4: 4 36.17120 35.95853 36.33405 36.45134 37.17186 36.87482 35.45489
# 5: 5 37.20341 37.04881 36.53252 36.22922 36.78106 36.89219 37.13207
# 6: 6 36.12201 36.53433 37.29784 35.96451 36.70838 36.58684 36.60122
# 7: 7 36.92314 36.16220 36.48154 37.05324 36.57829 36.24955 37.23835
# 8: 8 35.71390 37.26879 37.01673 36.65364 36.89143 36.46331 37.15398
# 9: 9 36.63558 37.03452 36.40129 37.53705 36.03568 36.78083 36.71873
# 10:10 36.77329 36.07161 36.42992 36.20715 36.78880 36.79875 36.15004
# 11:11 36.66199 36.74958 36.28661 36.72539 36.17700 37.47495 35.60980
# 12:14 37.40307 35.89668 36.30619 36.64382 37.21882 35.87420 35.45550
# 13:18 36.51696 37.09903 37.31166 36.51000 36.42414 36.87976 36.45736
# 14:19 37.05117 37.42526 36.15820 36.11824 37.07024 36.60699 36.80168

Efficient way for insertion of multiple rows at given indices & with repetitions

I have a data frame (DATA) with > 2 million rows (observations at different time points) and another data frame (INSERTION) which gives info about missing observations. The latter object contains 2 columns: 1st column with row indices after which empty (NA) rows should be inserted into DATA, and 2nd column with the number of empty rows that should be inserted at that position.
Below is a minimum working example:
DATA <- data.frame(datetime=strptime(as.character(c(201301011700, 201301011701, 201301011703, 201301011704, 201301011705, 201301011708, 201301011710, 201301011711, 201301011715, 201301011716, 201301011718, 201301011719, 201301011721, 201301011722, 201301011723, 201301011724, 201301011725, 201301011726, 201301011727, 201301011729, 201301011730, 201301011731, 201301011732, 201301011733, 201301011734, 201301011735, 201301011736, 201301011737, 201301011738, 201301011739)), format="%Y%m%d%H%M"), var1=rnorm(30), var2=rnorm(30), var3=rnorm(30))
INSERTION <- data.frame(index=c(2, 5, 6, 8, 10, 12, 19), repetition=c(1, 2, 1, 3, 1, 1, 1))
Now I'm looking for an efficient (and thus fast) way to insert the n empty rows at given row indices of the original file. How can I additionally complement the correct datetimes for these empty rows (add 1 minute for every new row; however, every weekend and bank holidays there are some regular gaps which are not contained in INSERTION!)?
Any help is appreciated!
Looking at the pattern in INSERTION and matching it with DATA most probably you are trying to fill the missing minutes in datetime of DATA. You can create a dataframe with every minute sequence from min to max value of datetime from DATA and then merge
merge(data.frame(datetime = seq(min(DATA$datetime), max(DATA$datetime),
by = "1 min")),DATA, all.x = TRUE)
# datetime var1 var2 var3
#1 2013-01-01 17:00:00 -1.063326 0.11925 -0.788622
#2 2013-01-01 17:01:00 1.263185 0.24369 -0.502199
#3 2013-01-01 17:02:00 NA NA NA
#4 2013-01-01 17:03:00 -0.349650 1.23248 1.496061
#5 2013-01-01 17:04:00 -0.865513 -0.51606 -1.137304
#6 2013-01-01 17:05:00 -0.236280 -0.99251 -0.179052
#7 2013-01-01 17:06:00 NA NA NA
#8 2013-01-01 17:07:00 NA NA NA
#9 2013-01-01 17:08:00 -0.197176 1.67570 1.902362
#10 2013-01-01 17:09:00 NA NA NA
#...
#...
Or using similar logic with tidyr::complete
tidyr::complete(DATA, datetime = seq(min(datetime), max(datetime), by = "1 min"))
If performance is a factor on a large data frame, this approach avoids joins:
# Generate new data.frame containing missing datetimes
tmp <- data.frame(datetime = DATA$datetime[with(INSERTION, rep(index, repetition))] + sequence(INSERTION$repetition)*60)
# Create variables filled with NA to match main data.frame
tmp[setdiff(names(DATA), names(tmp))] <- NA
# Bind and sort
new_df <- rbind(DATA, tmp)
new_df <- new_df[order(new_df$datetime),]
head(new_df, 15)
datetime var1 var2 var3
1 2013-01-01 17:00:00 0.98789253 0.68364933 0.70526985
2 2013-01-01 17:01:00 -0.68307496 0.02947599 0.90731512
31 2013-01-01 17:02:00 NA NA NA
3 2013-01-01 17:03:00 -0.60189915 -1.00153188 0.06165694
4 2013-01-01 17:04:00 -0.87329313 -1.81532302 -2.04930719
5 2013-01-01 17:05:00 -0.58713154 -0.42313098 0.37402224
32 2013-01-01 17:06:00 NA NA NA
33 2013-01-01 17:07:00 NA NA NA
6 2013-01-01 17:08:00 2.41350911 -0.13691754 1.57618578
34 2013-01-01 17:09:00 NA NA NA
7 2013-01-01 17:10:00 -0.38961552 0.83838954 1.18283382
8 2013-01-01 17:11:00 0.02290672 -2.10825367 0.87441448
35 2013-01-01 17:12:00 NA NA NA
36 2013-01-01 17:13:00 NA NA NA
37 2013-01-01 17:14:00 NA NA NA

Compute the variance of a moving window in a dataframe

Hey I want to compute the variance of column. My dataframe is sorted by the as.Date() format. Here you can see a snippet of it:
Date USA ARG BRA CHL COL MEX PER
2012-04-01 1 0.2271531 0.4970299 0.001956865 0.0005341452 0.07341428 NA
2012-05-01 1 0.2218906 0.4675895 0.001911405 0.0005273186 0.07026524 NA
2012-06-01 1 0.2054076 0.4531661 0.001891352 0.0005292575 0.06897811 NA
2012-07-01 1 0.2033470 0.4596730 0.001950686 0.0005312600 0.07269619 NA
2012-08-01 1 0.1993882 0.4596039 0.001980537 0.0005271514 0.07268987 NA
2012-09-01 1 0.1967152 0.4593390 0.002011212 0.0005305549 0.07418838 NA
2012-10-01 1 0.1972730 0.4597584 0.002002203 0.0005284380 0.07428555 NA
2012-11-01 1 0.1937618 0.4519187 0.001979805 0.0005238670 0.07329656 NA
2012-12-01 1 0.1854037 0.4500448 0.001993309 0.0005323795 0.07453949 NA
2013-01-01 1 0.1866007 0.4607501 0.002013112 0.0005412329 0.07551040 NA
2013-02-01 1 0.1855950 0.4712956 0.002011067 0.0005359562 0.07554661 NA
The dataframe ranges from january 2004 up to dezember 2018. But I do not want to compute the compute the variance of the whole columnes.
I want to compute the variance of one year (or 12 values) which is moving month by month.
I do not really know how to start. I can imagine using the zoo package and the rollapply. But here the problem is (I think) that R computes uses the values around it and not past it?
I also found this question: R: create a data frame out of a rolling window, so my idea was to get rid of the date column. It is easy to build the matrix, but now I do not understand how to apply the variance function to my data...
Is there a smart way to compute it all in one and also using the information of the date? If not, I also appreciate any other solution from you!
We can use rollappyr to perform the rolling computations. Since there are only 11 rows in the data in the question we can't take 12 month averages but using 3 month averages instead we can illustrate it. Remove fill = NA if you want to omit the NA rows or replace it with partial = TRUE if you want variances using fewer than 12 near the beginning. If you want a data frame result use fortify.zoo(zv) .
library(zoo)
z <- read.zoo(DF)
zv <- rollapplyr(z, 3, var, fill = NA)
zv
giving this zoo object:
USA ARG BRA CHL COL MEX PER
2012-04-01 NA NA NA NA NA NA NA
2012-05-01 NA NA NA NA NA NA NA
2012-06-01 0 1.287083e-04 4.998008e-04 1.126781e-09 1.237524e-11 5.208793e-06 NA
2012-07-01 0 1.033001e-04 5.217420e-05 9.109406e-10 3.883996e-12 3.565057e-06 NA
2012-08-01 0 9.358558e-06 1.396497e-05 2.060928e-09 4.221043e-12 4.600220e-06 NA
2012-09-01 0 1.113297e-05 3.108380e-08 9.159058e-10 4.826929e-12 7.453672e-07 NA
2012-10-01 0 1.988357e-06 4.498977e-08 2.485889e-10 2.953403e-12 8.001948e-07 NA
2012-11-01 0 3.560373e-06 1.944961e-05 2.615387e-10 1.168389e-11 2.971477e-07 NA
2012-12-01 0 3.717777e-05 2.655440e-05 1.271886e-10 1.814869e-11 4.312436e-07 NA
2013-01-01 0 2.042867e-05 3.268476e-05 2.806455e-10 7.540331e-11 1.231438e-06 NA
2013-02-01 0 4.134729e-07 1.129013e-04 1.186146e-10 1.983651e-11 3.263780e-07 NA
We can plot the log of the variances like this:
library(ggplot2)
autoplot(log(zv), facet = NULL) + geom_point() + ylab("log(var(.))")
Note
We assume that the starting point is the data frame generated reproducibly below:
Lines <- "Date USA ARG BRA CHL COL MEX PER
2012-04-01 1 0.2271531 0.4970299 0.001956865 0.0005341452 0.07341428 NA
2012-05-01 1 0.2218906 0.4675895 0.001911405 0.0005273186 0.07026524 NA
2012-06-01 1 0.2054076 0.4531661 0.001891352 0.0005292575 0.06897811 NA
2012-07-01 1 0.2033470 0.4596730 0.001950686 0.0005312600 0.07269619 NA
2012-08-01 1 0.1993882 0.4596039 0.001980537 0.0005271514 0.07268987 NA
2012-09-01 1 0.1967152 0.4593390 0.002011212 0.0005305549 0.07418838 NA
2012-10-01 1 0.1972730 0.4597584 0.002002203 0.0005284380 0.07428555 NA
2012-11-01 1 0.1937618 0.4519187 0.001979805 0.0005238670 0.07329656 NA
2012-12-01 1 0.1854037 0.4500448 0.001993309 0.0005323795 0.07453949 NA
2013-01-01 1 0.1866007 0.4607501 0.002013112 0.0005412329 0.07551040 NA
2013-02-01 1 0.1855950 0.4712956 0.002011067 0.0005359562 0.07554661 NA"
DF <- read.table(text = Lines, header = TRUE)

Condition for function and loop

I have a data frame simplified as follow:
head(dendro)
X DateTime ID diameter dendro ring DOY month mday year Rain_mm_Tot Through_Tot temp
1 1 2012-06-21 13:45:00 r1_1 5482 1 1 173 6 22 113 NA NA NA
2 2 2012-06-21 13:45:00 r2_3 NA 3 2 173 6 22 113 NA NA NA
3 3 2012-06-21 13:45:00 r1_2 5534 2 1 173 6 22 113 NA NA NA
4 4 2012-06-21 13:45:00 r2_4 NA 4 2 173 6 22 113 NA NA NA
5 5 2012-06-21 13:45:00 r1_3 5606 3 1 173 6 22 113 NA NA NA
6 6 2012-06-21 13:45:00 r2_5 NA 5 2 173 6 22 113 NA NA NA
The dataframe is first splitted by "ID", so it's a list of IDs
After that I apply a function, that includes a loop, and the result is a new column "Diameter2", with the result I want from the function, that works OK:
dendro_sp <- split(dendro, dendro$ID)
library(changepoint)
dendro_sp <- lapply(dendro_sp, function(x){
x <- subset(x, !is.na(diameter))
cpfit <- cpt.mean(x$diameter, method="BinSeg")
x$diameter2 <- x$diameter
cpts <- cpfit#cpts
means <- param.est(cpfit)$mean
meanZero <- means[1]
for(i in 1:(length(cpts)-1)){
x$diameter2[(cpts[i]+1):cpts[i+1]] <- x$diameter2[(cpts[i]+1):cpts[i+1]] + (meanZero - means[i+1])
}
return(x)
})
dendro2 <- do.call(rbind, dendro_sp)
rownames(dendro2) <- NULL
My problem is that I want it to apply it conditionally, for example to r1_1 and r1_3, and grab the "diameter" value for r3 in the new column "diameter2", instead of applying the function for the rest of IDs:
ifelse(diameter$ID==c("r1_1","r1_3"), apply_the_function_to_r11_and_r13_to_calculate_diameter2, otherwise_write_diameter_value_in_diameter2_column)
Remember that the dataframe "dendro" is splitted by ID, I don't know if that is important to define the condition for several IDs.
Thanks
I am not sure if I understand the problem correctly. I try to answer.
I assume you want to apply a function to the "diameter" field of the "diameter" data.frame, conditioning on the "ID" field and retunr the result in the corresponding diameter2 field. I don't know how the function works, so forgive me if this will not work.
Selected fields
diameter$diameter2[diameter$ID=="r1_1"|diameter$ID=="r1_3"]<- yourfun(diameter$diameter[diameter$ID=="r1_1"|diameter$ID=="r1_3"]
Unselected fields
diameter$diameter2[diameter$ID!="r1_1" & diameter$ID=="r1_3"]<- diameter$diameter[diameter$ID=="r1_1"|diameter$ID=="r1_3"]

Resources