R Populate a vector by matching names to df column values - r

I have a named vector filled with zeros
toy1<- rep(0, length(37:45))
names(toy1) <- 37:45
I want to populate the vector with count data from a dataframe
size count
37 1.181
38 0.421
39 0.054
40 0.005
41 0.031
42 0.582
45 0.024
I need help finding a way to match the value for size to the vector name and then input the corresponding count value into that vector position

Might be as simple as:
toy1[ as.character(dat$size) ] <- dat$count
toy1
# 37 38 39 40 41 42 43 44 45
#1.181 0.421 0.054 0.005 0.031 0.582 0.000 0.000 0.024
R's indexing for assignments can have character values. If you had just tried to index with the raw column:
toy1[ dat$size ] <- dat$count
You would have gotten (as did I initially):
> toy1
37 38 39 40 41 42 43 44 45
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1.181 0.421
0.054 0.005 0.031 0.582 NA NA 0.024
That occurred because numeric indexing occurred and there was default extension of the length of the vector to accommodate the numbers up to 45.
With a version of the dataframe that had a number that was not in the range 37:45, I did get a warning from using match with a nomatch of 0, but I also got the expected results:
toy1[ match( as.character( dat$size), names(toy1) , nomatch=0) ] <- dat$count
#------------
Warning message:
In toy1[match(as.character(dat$size), names(toy1), nomatch = 0)] <- dat$count :
number of items to replace is not a multiple of replacement length
> toy1
37 38 39 40 41 42 43 44 45
1.181 0.421 0.054 0.005 0.031 0.582 0.000 0.000 0.000
The match function is at the core of the merge function but this application would be much faster than a merge of dataframes

Lets say your data frame is df, then you can just update the records in toy1 for records available in your data frame:
toy1[as.character(df$size)] <- df$count
Edit: To check for a match m before updating the records. m are the matched indices in size column of df:
m <- match(names(toy1), as.character(df$size))
Then, for the indices in toy1 which have a match, it can be updated as below:
toy1[which(!is.na(m))] <- df$count[m[!is.na(m)]]
PS: Efficient way would be to define toy1 as a data frame and perform an outer join by size column.

First, let's get the data loaded in.
toy1<- rep(0, length(37:45))
names(toy1) <- 37:45
df = read.table(text="37 1.181
38 0.421
39 0.054
40 0.005
41 0.031
42 0.582
45 0.024")
names(df) = c("size","count")
Now, I present a really ugly solution. We only update toy1 where the name of toy1 appears in df$size. We return df$count by obtaining the index of the match in df. I use sapply to get a vector of the index back. On both sizes we only look for places where names(toy1) appear in df$size.
toy1[names(toy1) %in% df$size] = df$count[sapply(names(toy1)[names(toy1) %in% df$size],function(x){which(x == df$size)})]
But, this isn't very elegant. Instead, you could turn toy1 into a data.frame.
toydf = data.frame(toy1 = toy1,name = names(toy1),stringsAsFactors = FALSE)
Now, we can use merge to get the values.
updated = merge(toydf,df,by.x = "name",by.y="size",all.x=T)
This returns a 3 column data.frame. You can then extract the count column from this, replace NA with 0 and you're done.
updated$count[is.na(updated$count)] = 0
updated$count
#> [1] 1.181 0.421 0.054 0.005 0.031 0.582 0.000 0.000 0.024

Related

How to group by and fill NA with closest not NA in R dataframe column with condition on another column

I have a data frame of blood test markers results and I want to fill in the NA's by the following criteria:
For each group of ID (TIME is in ascending order) if the marker value is NA then fill it with the closest not NA value in this group (may be past or future) but only if the time difference is less than 14.
this example of my data:
df<-data.frame(ID=c(rep(2,5),rep(4,3)), TIME =c(1,22,33,43,85,-48,1,30),
CEA = c(1.32,1.42,1.81,2.33,2.23,29.7,23.34,18.23),
CA.15.3 = c(14.62,14.59,16.8,22.34,36.33,56.02,94.09,121.5),
CA.125 = c(33.98,27.56,30.31,NA,39.57,1171.00,956.50,825.30),
CA.19.9 = c(6.18,7.11,5.72, NA, 7.38,39.30,118.20,98.26),
CA.72.4 = c(rep(NA,5),1.32, NA, NA),
NSE = c(NA, 13.21, rep(NA,6)))
ID TIME CEA CA.15.3 CA.125 CA.19.9 CA.72.4 NSE
2 1 1.32 14.62 33.98 6.18 NA NA
2 22 1.42 14.59 27.56 7.11 NA 13.21
2 33 1.81 16.80 30.31 5.72 NA NA
2 43 2.33 22.34 NA NA NA NA
2 85 2.23 36.33 39.57 7.38 NA NA
4 -48 29.70 56.02 1171.00 39.30 1.32 NA
4 1 23.34 94.09 956.50 118.20 NA NA
4 30 18.23 121.50 825.30 98.26 NA NA
ID is the patient.
The TIME is the time of the blood test.
The others are the markers.
The only way I could do it is with loops which I try to avoid as much as possible.
I expect the output to be:
ID TIME CEA CA.15.3 CA.125 CA.19.9 CA.72.4 NSE
2 1 1.32 14.62 33.98 6.18 NA NA
2 22 1.42 14.59 27.56 7.11 NA 13.21
2 33 1.81 16.80 30.31 5.72 NA 13.21
2 43 2.33 22.34 30.31 5.72 NA NA
2 85 2.23 36.33 39.57 7.38 NA NA
4 -48 29.70 56.02 1171.00 39.30 1.32 NA
4 1 23.34 94.09 956.50 118.20 NA NA
4 30 18.23 121.50 825.30 98.26 NA NA
CA.19.9 and CA.124 are filled with the previous (10 days before)
NSE filled with the previous (11 days)
CA.72.4 not filled since the time difference of 1.32 which is -48 is 49 days from the next measure.
I bet there is a much simpler, vectorized solution but the following works.
fill_NA <- function(DF){
sp <- split(df, df$ID)
sp <- lapply(sp, function(DF){
d <- diff(DF$TIME)
i_diff <- c(FALSE, d < 14)
res <- sapply(DF[-(1:2)], function(X){
inx <- i_diff & is.na(X)
if(any(inx)){
inx <- which(inx)
last_change <- -1
for(i in inx){
if(i > last_change + 1){
if(i == 1){
X[i] <- X[i + 1]
}else{
X[i] <- X[i - 1]
}
last_change <- i
}
}
}
X
})
cbind(DF[1:2], res)
})
res <- do.call(rbind, sp)
row.names(res) <- NULL
res
}
fill_NA(df)
# ID TIME CEA CA.15.3 CA.125 CA.19.9 CA.72.4 NSE
#1 2 1 1.32 14.62 33.98 6.18 NA NA
#2 2 22 1.42 14.59 27.56 7.11 NA 13.21
#3 2 33 1.81 16.80 30.31 5.72 NA 13.21
#4 2 43 2.33 22.34 30.31 5.72 NA NA
#5 2 85 2.23 36.33 39.57 7.38 NA NA
#6 4 -48 29.70 56.02 1171.00 39.30 1.32 NA
#7 4 1 23.34 94.09 956.50 118.20 NA NA
#8 4 30 18.23 121.50 825.30 98.26 NA NA
Yes, you can have a vectorized solution. first let us consider the case in which you only impute using the future value. You need to create few auxiliary variables:
a variable that tells you whether the next observation belong to the same id (so it can be used to impute),
a variable that tells you whether the next observation is less than 14 days apart from the current one.
These do not depend on the specific variable you want to impute. for each variable to be imputed you will also need a variable that tells you whether the next variable is missing.
Then you can vectorize the following logic: when the next observation has the same id, and when it is less than 14 days from the current one and it is not missing copy its value in the current one.
Things get more complicated when you need to decide whether to use the past or future value, but the logic is the same. the code is below, it is a bit long but you can simplify it, I just wanted to be clear about what it does.
Hope this helps
x <-data.frame(ID=c(rep(2,5),rep(4,3)), TIME =c(1,22,33,43,85,-48,1,30),
CEA = c(1.32,1.42,1.81,2.33,2.23,29.7,23.34,18.23),
CA.15.3 = c(14.62,14.59,16.8,22.34,36.33,56.02,94.09,121.5),
CA.125 = c(33.98,27.56,30.31,NA,39.57,1171.00,956.50,825.30),
CA.19.9 = c(6.18,7.11,5.72, NA, 7.38,39.30,118.20,98.26),
CA.72.4 = c(rep(NA,5),1.32, NA, NA),
NSE = c(NA, 13.21, rep(NA,6)))
### these are the columns we want to input
cols.to.impute <- colnames(x)[! colnames(x) %in% c("ID","TIME")]
### is the next id the same?
x$diffidf <- NA
x$diffidf[1:(nrow(x)-1)] <- diff(x$ID)
x$diffidf[x$diffidf > 0] <- NA
### is the previous id the same?
x$diffidb <- NA
x$diffidb[2:nrow(x)] <- diff(x$ID)
x$diffidb[x$diffidb > 0] <- NA
### diff in time with next observation
x$difftimef <- NA
x$difftimef[1:(nrow(x)-1)] <- diff(x$TIME)
### diff in time with previous observation
x$difftimeb <- NA
x$difftimeb[2:nrow(x)] <- diff(x$TIME)
### if next (previous) id is not the same time difference is not meaningful
x$difftimef[is.na(x$diffidf)] <- NA
x$difftimeb[is.na(x$diffidb)] <- NA
### we do not need diffid anymore (due to previous statement)
x$diffidf <- x$diffidb <- NULL
### if next (previous) point in time is more than 14 days it is not useful for imputation
x$difftimef[abs(x$difftimef) > 14] <- NA
x$difftimeb[abs(x$difftimeb) > 14] <- NA
### create variable usef that tells us whether we should attempt to use the forward observation for imputation
### it is 1 only if difftime forward is less than difftime backward
x$usef <- NA
x$usef[!is.na(x$difftimef) & x$difftimef < x$difftimeb] <- 1
x$usef[!is.na(x$difftimef) & is.na(x$difftimeb)] <- 1
x$usef[is.na(x$difftimef) & !is.na(x$difftimeb)] <- 0
if (!is.na(x$usef[nrow(x)]))
stop("\nlast observation usef is not missing\n")
### now we get into column specific operations.
for (col in cols.to.impute){
### we will store the results in x$imputed, and copy into c[,col] at the end
x$imputed <- x[,col]
### x$usef needs to be modified depending on the specific column, so we define a local version of it
x$usef.local <- x$usef
### if a variable is not missing no point in looking at usef.local, so we make it missing
x$usef.local[!is.na(x[,col])] <- NA
### when usef.local is 1 but the next observation is missing it cannot be used for imputation, so we
### make it 0. but a value of 0 does not mean we can use the previous observation because that may
### be missing too. so first we make usef 0 and next we check the previous observation and if that
### is missing too we make usef missing
x$previous.value <- c(NA,x[1:(nrow(x)-1),col])
x$next.value <- c(x[2:nrow(x),col],NA)
x$next.missing <- is.na(x$next.value)
x$previous.missing <- is.na(x$previous.value)
x$usef.local[x$next.missing & x$usef.local == 1] <- 0
x$usef.local[x$previous.missing & x$usef.local == 0] <- NA
### now we can impute properly: use next value when usef.local is 1 and previous value when usef.local is 0
tmp <- rep(FALSE,nrow(x))
tmp[x$usef.local == 1] <- TRUE
x$imputed[tmp] <- x$next.value[tmp]
tmp <- rep(FALSE,nrow(x))
tmp[x$usef.local == 0] <- TRUE
x$imputed[tmp] <- x$previous.value[tmp]
### copy to column
x[,col] <- x$imputed
}
### get rid of useless temporary stuff
x$previous.value <- x$previous.missing <- x$next.value <- x$next.missing <- x$imputed <- x$usef.local <- NULL
ID TIME CEA CA.15.3 CA.125 CA.19.9 CA.72.4 NSE difftimef difftimeb usef
1 2 1 1.32 14.62 33.98 6.18 NA NA NA NA NA
2 2 22 1.42 14.59 27.56 7.11 NA 13.21 11 NA 1
3 2 33 1.81 16.80 30.31 5.72 NA 13.21 10 11 1
4 2 43 2.33 22.34 30.31 5.72 NA NA NA 10 0
5 2 85 2.23 36.33 39.57 7.38 NA NA NA NA NA
6 4 -48 29.70 56.02 1171.00 39.30 1.32 NA NA NA NA
7 4 1 23.34 94.09 956.50 118.20 NA NA NA NA NA
8 4 30 18.23 121.50 825.30 98.26 NA NA NA NA NA
>

Using regex to set a specific digit to NA? [duplicate]

This question already has answers here:
Replace all particular values in a data frame
(8 answers)
Closed 4 years ago.
Sample of df:
LASSO_deviance LASSO_AUC
68 0.999 0.999
2 1.000 1.000
39 1.000 1.005
7 1.02 1.2
I want to set cells which contain 1.000 to either NA or 0, in preferential order.
I've tried something like: df %>% mutate_at(vars(LASSO_deviance, LASSO_AUC), funs(gsub(pattern = "1{1}[^.{1,}]", 0, x = .))) with no luck.
tt <- "LASSO_deviance LASSO_AUC
68 0.999 0.999
2 1.000 1.000
39 1.000 1.005
7 1.02 1.2"
dat <- read.table(text = tt, header = T)
No need for regex because you can simply find where your data is equal to 1.000
dat[dat == 1.000] <- NA # or dat[dat == 1.000] <- 0
dat
# LASSO_deviance LASSO_AUC
# 68 0.999 0.999
# 2 NA NA
# 39 NA 1.005
# 7 1.020 1.200

Two columns from a single column values in r

I have data of single column and want to convert into two columns:
beta
2
.002
52
.06
61
0.09
70
0.12
85
0.92
I want into two col as:
col1 col2
2 0.002
52 0.06
61 0.09
70 0.12
85 0.92
Can anyone please help me sort this out????
We can unlist the dataframe and convert it into the matrix of nrow/2 rows
data.frame(matrix(unlist(df), nrow = nrow(df)/2, byrow = T))
# X1 X2
#1 2 0.002
#2 52 0.060
#3 61 0.090
#4 70 0.120
#5 85 0.920
We can do a logical index and create two columns
i1 <- c(TRUE, FALSE)
df2 <- data.frame(col1 = df1$beta[i1], col2 = df1$beta[!i1])

Sum of columns of dataframe based on time period in R

I have a dataframe with multiple columns and and multiple rows. The data is based on monthly observations over the period of 11 years. Now I want to take the sum of each column based on observations for previous 12 months. For example sum of column for Jan-05 is based on its observations from Jan-04 to Dec-04. And for Feb-05 is based on observations from Feb-04 to Jan-05 and so on. My original data frame has data for 10 years and monthly data.
I illustrate part of my dataframe as follows:
df1
Month A B C
Jan-04 0.003 0.006 NA
Feb-04 0.003 0.002 NA
Mar-04 -0.005 -0.001 NA
Apr-04 0.000 0.000 NA
May-04 0.000 -0.002 NA
Jun-04 -0.001 -0.001 NA
Jul-04 -0.001 -0.001 NA
Aug-04 -0.010 NA NA
Sep-04 0.001 NA NA
Oct-04 0.002 NA NA
Nov-04 -0.003 NA NA
Dec-04 -0.003 NA NA
Jan-05 0.005 -0.002 NA
Feb-05 -0.0015 0.004 0.0003
Mar-05 -0.0041 0.002 0.0070
The desired resultant dataframe
Month A B C
Jan-05 -0.013 0.004 NA
Feb-05 -0.011 -0.004 NA
Mar-05 -0.0151 -0.0014 0.0003
Here is a solution in base R. First we define a function to subset the df based on the time difference from the date of interest and find the column sums on that subsetted df, and then we run that function for all of the time points of interest.
subset_last_year <- function(df, date, cols_to_sum = c("A", "B", "C")){
date = as.POSIXct(date, format = "%d-%b-%y")
df$Time_Difference = difftime(date, df$Month_Date, units = "weeks")
df_last_year = df[df$Time_Difference > 0 & df$Time_Difference < 53, ]
tmp_col_sum = colSums(df_last_year[ , cols_to_sum], na.rm = TRUE)
return(tmp_col_sum)
}
#oddly you have to add days
df$Month_Date = paste0("01-", df$Month)
df$Month_Date = as.POSIXct(df$Month_Date, format = "%d-%b-%y")
#not worried about performance because the data set is not that large
dates = c("01-Jan-05", "01-Feb-05", "01-Mar-05")
res = data.frame()
for(i in 1:length(dates)){
tmp = subset_last_year(df, dates[i])
res = rbind(res, tmp)
}
rownames(res) = dates
colnames(res) = c("A", "B", "C")

Dividing specific values between two arrays

I am a novice R user trying to work with a data set of 40,000 rows and 300 columns. I have found a solution for what I would like to do, however my machine takes over an hour to run my code and I feel like an expert could help me with a quicker solution (as I can do this in excel in half the time). I will post my solution at the end.
What I would like to do is the following:
Compute the average value for each column NY1 to NYn based on the value of the YYYYMMbucket column.
Divide original value by the its average YYYYMMbucket value.
Here is sample of my original data set:
YYYYMMbucket NY1 NY2 NY3 NY4
1 200701.3 0.309 NA 20.719 16260
2 200701.3 0.265 NA 19.482 15138
3 200701.3 0.239 NA 19.168 14418
4 200701.3 0.225 NA 19.106 14046
5 200701.3 0.223 NA 19.211 14040
6 200701.3 0.234 NA 19.621 14718
7 200701.3 0.270 NA 20.522 15780
8 200701.3 0.298 NA 22.284 16662
9 200701.2 0.330 NA 23.420 16914
10 200701.2 0.354 NA 23.805 17310
11 200701.2 0.388 NA 24.095 17448
12 200701.2 0.367 NA 23.954 17640
13 200701.2 0.355 NA 23.255 17748
14 200701.2 0.346 NA 22.731 17544
15 200701.2 0.347 NA 22.445 17472
16 200701.2 0.366 NA 21.945 17634
17 200701.2 0.408 NA 22.683 18876
18 200701.2 0.478 NA 23.189 21498
19 200701.2 0.550 NA 23.785 22284
20 200701.2 0.601 NA 24.515 22368
This is what my averages look like:
YYYYMMbucket NY1M NY2M
1 200701.1 0.4424574 NA
2 200701.2 0.4530000 NA
3 200701.3 0.2936935 NA
4 200702.1 0.4624063 NA
5 200702.2 0.4785937 NA
6 200702.3 0.3091161 NA
7 200703.1 0.4159687 NA
8 200703.2 0.4491875 NA
9 200703.3 0.2840081 NA
10 200704.1 0.4279137 NA
How I would like my final output to look:
NY1avgs NY2avgs NY3avgs
1 1.052117 NA 0.7560868
2 0.9023011 NA 0.7109456
3 0.8137734 NA 0.699487
4 0.7661047 NA 0.6972245
5 0.7592949 NA 0.7010562
6 0.7967489 NA 0.7160181
7 0.9193256 NA 0.7488978
8 1.014663 NA 0.8131974
9 0.7284768 NA 0.857904
Here's how I did it:
First I used "plyr" to compute my averages, simple enough:
test <- ddply(prf.delete2b,. (YYYYMMbucket), summarise,
NY1M = mean(NY1), NY2M = mean(NY2) ... ...))
Then used a series of the following:
x <- c(1:40893)
lookv <- function(x,ltab,rcol=2) ltab[max(which(ltab[,1]<=x)),rcol]
NY1Fun <- function(x) (prf.delete2b$NY1[x] / lookv((prf.delete2b$YYYYMMbucket[x]),test,2))
NY2Fun <- function(x) (prf.delete2b$NY2[x] / lookv((prf.delete2b$YYYYMMbucket[x]),test,3))
NY1Avgs <- lapply(x, NY1Fun)
NY2Avgs <- lapply(x, NY2Fun)
I also tried a variant of the above by saying:
NY1Fun <- function(x) (prf.delete2b$NY1[x] / subset(test, YYYYMMbucket == prf.delete2b$YYYYMMbucket[x], select =c(NY1M)))
lapply(x, NY1Fun)
Each variant of NYnFun takes a good 20 seconds to run so doing this 300 times takes much too long. Can anyone recommend any alternative to what I posted or point out any novice mistakes I've made?
Here is the customary data.table approach, which works pretty fast.
# CREATE DUMMY DATA
N = 1000
mydf = data.frame(
bucket = sample(letters, N, replace = T),
NY1 = runif(N),
NY2 = runif(N),
NY3 = runif(N),
NY4 = runif(N)
)
# SCALE COLUMNS BY AVG
library(data.table)
scale_x = function(x) x/ave(x)
mydt = data.table(mydf)
ans = mydt[,lapply(.SD, scale_x), by = 'bucket']
How about:
test2 <- merge(prfdelete2b,test,all.x=TRUE)
test2[2:ncol(prefdelete2b)]/test2[(ncol(prefdelete2b)+1):ncol(test2)]
In this case, I would use ave instead of ddply because ave returns a vector the same length as its input. ave only accepts a vector, so you need to use lapply to loop over the columns of your data.frame.
myFun <- function(x, groupVar) {
x / ave(x, groupVar, FUN=function(y) mean(y, na.rm=TRUE))
}
relToMeans <- data.frame(prf.delete2b[1],
lapply(prf.delete2b[-1], myFun, groupVar=prf.delete2b[1]))

Resources