Suppose we start with this dataframe, and R-code that generates it immediately below:
> data
ID Period Values Flags
1 1 1 5 X0
2 1 2 10 X1
3 1 3 15 X2
4 1 4 20 X3
5 2 1 0 X0
6 2 2 2 X2
7 2 3 4 XO
8 2 4 6 X1
9 3 1 3 XO
10 3 2 6 XO
11 3 3 9 X2
12 3 4 12 XO
data <-
data.frame(
ID = c(1,1,1,1,2,2,2,2,3,3,3,3),
Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
Values = c(5, 10, 15, 20, 0, 2, 4, 6, 3, 6, 9, 12),
Flags = c("X0","X1","X2","X3","X0","X2","XO", "X1", "XO","XO","X2","XO")
)
I am trying to generate code that shows the migration of the number of ID's (and Values by number of ID's) from one "Flag" category to the next, based on the 2 periods input by the user. So for example, if the user inputs period 1 as the "from" period and period 4 as the "to" period, we'd get the migration tables as shown in the image at the bottom. I also include 2/3 from/to on the right side of the image, for sake of illustration.
I've typically done this sort of analysis in Excel, a cumbersome multi-step process, and am now trying it out in R.
Any suggestions for coding this?
Here are functions to create the two required tables.
Note that you had a problem with X0 and XO in your mock data set as r2evans already suggested (I've transformed everything to X0).
data <-
data.frame(
ID = c(1,1,1,1,2,2,2,2,3,3,3,3),
Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
Values = c(5, 10, 15, 20, 0, 2, 4, 6, 3, 6, 9, 12),
Flags = c("X0","X1","X2","X3","X0","X2","X0", "X1", "X0","X0","X2","X0")
)
generateTable <- function(data){
df <- data.frame(matrix(NA, ncol=length(unique(data$Flags)), nrow=length(unique(data$Flags))))
row.names(df) <- unique(data$Flags)
names(df) <- unique(data$Flags)
return(df)
}
numbers2migrate <- function(data, from=1, to=4){
df <- generateTable(data)
for (i in unique(data$ID)){
id_from <- as.character(data$Flags[(data$ID == i & data$Period == from)])
id_to <- as.character(data$Flags[data$ID == i & data$Period == to])
column <- which(names(df) == id_from)
row <- which(row.names(df) == id_to)
df[row, column] <- ifelse(is.na(df[row, column]), 1, df[row, column] + 1)
}
return(df)
}
values2migrate <- function(data, from=1, to=4){
df <- generateTable(data)
for (i in unique(data$ID)){
id_from <- as.character(data$Flags[(data$ID == i & data$Period == from)])
id_to <- as.character(data$Flags[data$ID == i & data$Period == to])
column <- which(names(df) == id_from)
row <- which(row.names(df) == id_to)
val <- (data$Values[(data$ID == i & data$Period == from)])
df[row, column] <- val
}
return(df)
}
> numbers2migrate(data, from=1, to=4)
X0 X1 X2 X3
X0 1 NA NA NA
X1 1 NA NA NA
X2 NA NA NA NA
X3 1 NA NA NA
> values2migrate(data,1,4)
X0 X1 X2 X3
X0 3 NA NA NA
X1 0 NA NA NA
X2 NA NA NA NA
X3 5 NA NA NA
Related
For each element in a vector, I want the corresponding next smaller value in the vector, without changing the original order of the elements.
For example, suppose the given vector is:
c(4, 5, 5, 10, 3, 7)
Then the result would be:
c(3, 4, 4, 7, 0, 5)
Note that since 3 does not have any smaller value, I want it to be replaced with 0.
Any help will be much appreciated. Thank you.
We may use
sapply(v1, function(x) sort(v1)[match(x, sort(v1))-1][1])
[1] 3 4 4 7 NA 5
Or use a vectorized option
v2 <- unique(v1)
v3 <- sort(v2)
v4 <- v3[-length(v3)]
i1 <- match(v1, v3) - 1
i1[i1 == 0] <- NA
v4[i1]
[1] 3 4 4 7 NA 5
data
v1 <- c(4, 5, 5, 10, 3, 7)
We can try the code below using outer + max.col
> m <- outer(v, u <- sort(unique(v)), `>`)
> replace(u[max.col(m, ties.method = "last")], rowSums(m) == 0, NA)
[1] 3 4 4 7 NA 5
Using findInterval:
sx = sort(x)
i = findInterval(x, sx, left.open = TRUE)
sx[replace(i, i == 0, NA)]
# [1] 3 4 4 7 NA 5
I have 12 variables that contained NA values as well. I need to covert NAs to a separate level. Level value for some variables is different. Following is the code:
Replace_NAs <- function(colindex, na_level){
cname <- colnames(tr[colindex])
tr <- tr %>% mutate(cname = as.factor(replace(cname, is.na(cname), na_level)))
return(tr)
}
for (i in 1:12) {
if(i == 5){
na_level <- 3;
tr <- Replace_NAs(i,na_level);
}
else if(i == 11){
na_level <- 5;
tr <- Replace_NAs(i,na_level);
}
else if(i == 4|6|8){
na_level <- 1;
tr <- Replace_NAs(i,na_level);
}
else{
na_level <- 20;
tr <- Replace_NAs(i,na_level);
}
}
Please help me. Thanks.
As Johan mentioned in the comments, you should include a reproducible example. Without that, we're left guessing at what exactly you want.
That said, here's my guess at what'll help you:
df %>%
mutate_at(vars(5), ~ replace_na(., 3)) %>%
mutate_at(vars(11), ~ replace_na(., 5)) %>%
mutate_at(vars(4, 6, 8) ~ replace_na(., 1)) %>%
mutate_at(vars(-c(4, 5, 6, 8, 11)), ~ replace_na(., 20))
Again, please provide a reproducible example with data and your desired output. A more robust answer to your question would explore applying a list of intended switches to your dataframe, but that would be overkill here.
Here's a way to do this using a for loop.
Consider this example :
tr <- data.frame(a = c(NA, 2, NA, 3), b = c(2, 3, NA, 4),
c = c(5, 6, NA, NA), d = c(1, 2, 3, NA))
tr
# a b c d
#1 NA 2 5 1
#2 2 3 6 2
#3 NA NA NA 3
#4 3 4 NA NA
Now prepare a list of column indices and a vector of replacement values
cols <- list(1, c(2, 3))
vals <- c(3, 5)
Use a for loop to replace the columns with the values
for(i in seq_along(cols)) {
tr[cols[[i]]][is.na(tr[cols[[i]]])] <- vals[i]
}
For remaining columns
f_cols <- setdiff(seq_len(ncol(tr)), unlist(cols))
tr[f_cols][is.na(tr[f_cols])] <- 20
tr
# a b c d
#1 3 2 5 1
#2 2 3 6 2
#3 3 5 5 3
#4 3 4 5 20
You can notice how NA's in column 1 is replaced with 3, how NA's in column 2 and 3 are replaced with 5 and for rest of the column it is replaced by 20.
I want to sample a number of contiguous rows from a data frame df.
df <- data.frame(C1 = c(1, 2, 4, 7, 9), C2 = c(2, 4, 6, 8, 10))
I am trying to get something similar to the following which allows me to sample 3 random rows and repeat the process 100 times.
test <- replicate(100, df[sample(1:nrow(df), 3, replace=T),], simplify=F)
By contiguous the result should be something like:
[[1]]
C1 C2
2 2 4
3 4 6
4 7 8
[[2]]
C1 C2
1 1 2
2 2 4
3 4 6
.
.
.
How could I achieve this?
We just need to sample the starting row index for a chunk.
sample.block <- function (DF, chunk.size) {
if (chunk.size > nrow(DF)) return(NULL)
start <- sample.int(nrow(DF) - chunk.size + 1, 1)
DF[start:(start + chunk.size - 1), ]
}
replicate(100, sample.block(df, 3), simplify = FALSE)
I would like to replace NAs in my data frame with values from another column. For example:
a1 <- c(1, 2, 4, NA, 2, NA)
b1 <- c(3, NA, 4, 4, 4, 3)
c1 <- c(NA, 3, 3, 4, 2, 3)
a2 <- c(2, 3, 5, 5, 3, 4)
b2 <- c(1, 2, 4, 5, 6, 3)
c2 <- c(3, 3, 2, 3, 4, 3)
df <- as.data.frame(cbind(a1, b1, c1, a2, b2, c2))
df
> df
a1 b1 c1 a2 b2 c2
1 1 3 NA 2 1 3
2 2 NA 3 3 2 3
3 4 4 3 5 4 2
4 NA 4 4 5 5 3
5 2 4 2 3 6 4
6 NA 3 3 4 3 3
I would like replace the NAs in df$a1 with the values from the corresponding row in df$a2, the NAs in df$b1 with the values from the corresponding row in df$b2, and the NAs in df$c1 with the values from the corresponding row in df$c2 so that the new data frame looks like:
> df
a1 b1 c1
1 1 3 3
2 2 2 3
3 4 4 3
4 5 4 4
5 2 4 2
6 4 3 3
How can I do this? I have a large data frame with many columns, so it would be great to find an efficient way to do this (I've already seen Replace missing values with a value from another column). Thank you!
An extensible option:
df2 <- df[c('a1','b1','c1')]
df2[] <- mapply(function(x,y) ifelse(is.na(x), y, x),
df[c('a1','b1','c1')], df[c('a2','b2','c2')],
SIMPLIFY=FALSE)
df2
# a1 b1 c1
# 1 1 3 3
# 2 2 2 3
# 3 4 4 3
# 4 5 4 4
# 5 2 4 2
# 6 4 3 3
It's easy enough to extend this to arbitrary column pairs: the first column in the first subset (df[c('a1','b1','c1')]) is paired with the first column of the second subset; second column first subset, second column second subset; etc. It can even be generalized with df[grepl('1$',colnames(df))] and df[grepl('2$',colnames(df))], assuming they don't mis-match.
coalesce in dplyr is meant to do exactly this (replace NAs in a first vector with not NA elements of a later one). e.g.
coalesce(df$a1,df$a2)
[1] 1 2 4 5 2 4
It can be used with sapply to do the whole dataset in an efficient and easily extensible manner:
sapply(c("a","b","c"),function(x) coalesce(df[,paste0(x,1)],df[,paste0(x,2)]))
a b c
[1,] 1 3 3
[2,] 2 2 3
[3,] 4 4 3
[4,] 5 4 4
[5,] 2 4 2
[6,] 4 3 3
dfnew<- ifelse(is.na(df$a1) == T, df$a2, df$a1)
as.data.frame(dfnew)
this is just for a1 col, you'll have to run this for all a,b and c and cbind it. if there are too many columns, running a loop will be the best option imo
You can use hutils::coalesce. It should be slightly faster, especially if it can 'cheat' -- if any columns have no NAs and so don't need to change, coalesce will skip them:
a1 <- c(1, 2, 4, NA, 2, NA)
b1 <- c(3, NA, 4, 4, 4, 3)
c1 <- c(NA, 3, 3, 4, 2, 3)
a2 <- c(2, 3, 5, 5, 3, 4)
b2 <- c(1, 2, 4, 5, 6, 3)
c2 <- c(3, 3, 2, 3, 4, 3)
s <- function(x) {
sample(x, size = 1e6, replace = TRUE)
}
df <- as.data.frame(cbind(a1 = s(a1), b1 = s(b1), c1 = s(c1),
a2 = s(a2), b2 = s(b2), c2 = s(c2)))
library(microbenchmark)
library(hutils)
library(data.table)
dt <- as.data.table(df)
old <- paste0(letters[1:3], "1") # you will need to specify
new <- paste0(letters[1:3], "2")
dplyr_coalesce <- function(df) {
ans <- df
for (j in seq_along(old)) {
o <- old[j]
n <- new[j]
ans[[o]] <- dplyr::coalesce(ans[[o]], df[[n]])
}
ans
}
hutils_coalesce <- function(df) {
ans <- df
for (j in seq_along(old)) {
o <- old[j]
n <- new[j]
ans[[o]] <- hutils::coalesce(ans[[o]], df[[n]])
}
ans
}
microbenchmark(dplyr = dplyr_coalesce(df),
hutils = hutils_coalesce(df))
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> dplyr 45.78123 61.76857 95.10870 69.21561 87.84774 1452.0800 100 b
#> hutils 36.48602 46.76336 63.46643 52.95736 64.53066 252.5608 100 a
Created on 2018-03-29 by the reprex package (v0.2.0).
I would like to merge several matrices using their row names.
These matrices do not have the same number of rows and columns.
For instance:
m1 <- matrix(c(1, 2, 3, 4, 5, 6), 3, 2)
rownames(m1) <- c("a","b","c")
m2 <- matrix(c(1, 2, 3, 5, 4, 5, 6, 2), 4, 2)
rownames(m2) <- c("a", "b", "c", "d")
m3 <- matrix(c(1, 2, 3, 4), 2,2)
rownames(m3) <- c("d", "e")
mlist <- list(m1, m2, m3)
For them I would like to get:
Row.names V1.x V2.x V1.y V2.y V1.z V2.z
a 1 4 1 4 NA NA
b 2 5 2 5 NA NA
c 3 6 3 6 NA NA
d NA NA 5 2 1 3
e NA NA NA NA 2 4
I have tried to use lapply with the function merge:
M <- lapply(mlist, merge, mlist, by = "row.names", all = TRUE)
However, it did not work:
Error in data.frame(c(1, 2, 3, 4, 5, 6), c(1, 2, 3, 5, 4, 5, 6, 2), c(1, :
arguments imply differing number of rows: 3, 4, 2
Is there an elegant way to merge these matrices?
You are trying to apply a reduction (?Reduce) to the list of matrices, where the reduction is basically merge. The problem is that merge(m1, m2, by = "row.names", all = T) doesn't give you a new merged matrix with row names, but instead returns the row names in the first column. This is why we need additional logic in the reduction function.
Reduce(function(a,b) {
res <- merge(a,b,by = "row.names", all = T);
rn <- res[,1]; # Row.names column of merge
res <- res[,-1]; # Actual data
row.names(res) <- rn; # Assign row.names
return(res) # Return the merged data with proper row.names
},
mlist[-1], # Reduce (left-to-right) by applying function(a,b) repeatedly
init = mlist[[1]] # Start with the first matrix
)
Or alternatively:
df <- mlist[[1]]
for (i in 2:length(mlist)) {
df <- merge(df, mlist[[i]], by = "row.names", all=T)
rownames(df) <- df$Row.names
df <- df[ , !(names(df) %in% "Row.names")]
}
# V1.x V2.x V1.y V2.y V1 V2
# a 1 4 1 4 NA NA
# b 2 5 2 5 NA NA
# c 3 6 3 6 NA NA
# d NA NA 5 2 1 3
# e NA NA NA NA 2 4
This could also be conceptualised as a reshape operation if the right long-form data.frame is passed to the function:
tmp <- do.call(rbind, mlist)
tmp <- data.frame(tmp, id=rownames(tmp),
time=rep(seq_along(mlist),sapply(mlist,nrow)) )
reshape(tmp, direction="wide")
# id X1.1 X2.1 X1.2 X2.2 X1.3 X2.3
#a a 1 4 1 4 NA NA
#b b 2 5 2 5 NA NA
#c c 3 6 3 6 NA NA
#d d NA NA 5 2 1 3
#e e NA NA NA NA 2 4