I have a data frame as below
df<- data.frame(a = c(1,2,3,4,5),
b = c(6,7,8,9,10),
c = c(11,12,13,14,15),
z = c("b","c","a","a","b"))
I'm trying to replace row values where that row's column name is equal to the value in column Z. The desired output is below
a b c z
1 1 NA 11 b
2 2 7 NA c
3 NA 8 13 a
4 NA 9 14 a
5 5 NA 15 b
I was thinking something like the following code applied to each row
If column name is equal to Z, replace value with NA
But can't figure it out. Any help appreciated
Cheers!
Matrix indexing match-ing the z column to the colnames
df[cbind(seq(nrow(df)),match(df$z,colnames(df[1:3])))] <- NA
df
a b c z
1 1 NA 11 b
2 2 7 NA c
3 NA 8 13 a
4 NA 9 14 a
5 5 NA 15 b
This is only going to work if the columns with the letters are in lexigraphic order:
> df[cbind(1:5,as.numeric(df$z))] <- rep(NA,5)
> df
a b c z
1 1 NA 11 b
2 2 7 NA c
3 NA 8 13 a
4 NA 9 14 a
5 5 NA 15 b
Related
I have 2 data frames with different rownames, e.g.:
df1 <- data.frame(A = c(1,3,7,1,5), B = c(5,2,9,5,5), C = c(1,1,3,4,5))
df2 <- data.frame(A = c(4,3,2), B = c(4,4,9), C = c(3,9,3))
rownames(df2) <- c(1, 3, 6)
> df1
A B C
1 1 5 1
2 3 2 1
3 7 9 3
4 1 5 4
5 5 5 5
> df2
A B C
1 4 4 3
3 3 4 9
6 2 9 3
I need to insert NA-rows in both data frames for each row that does exist in only one of the data frames. In the given example:
> df1
A B C
1 1 5 1
2 3 2 1
3 7 9 3
4 1 5 4
5 5 5 5
6 NA NA NA
> df2
A B C
1 4 4 3
2 NA NA NA
3 3 4 9
4 NA NA NA
5 NA NA NA
6 2 9 3
I will have to perform this operation many times with different data frames, so I need an automatized way to do this. I was trying to solve the issue with different if/else loops, but I am sure there must be a much more automatized way.
We can use functions union, %in% or intersect to find the common rownames and assign rows of an NA dataframe with the values of the dataset if it matches the rownames
un1 <- union(rownames(df1), rownames(df2))
d1 <- as.data.frame(matrix(NA, ncol = ncol(df1),
nrow = length(un1), dimnames = list(un1, names(df1))))
d2 <- d1
d1[rownames(d1) %in% rownames(df1),] <- df1
d2[rownames(d2) %in% rownames(df2),] <- df2
d2
# A B C
#1 4 4 3
#2 NA NA NA
#3 3 4 9
#4 NA NA NA
#5 NA NA NA
#6 2 9 3
df <- data.frame(x=c(1,2,1,2,3,3), y = c(letters[1:5],'a'), val = c(1:5, 9))
print(df)
x y val
1 a 1
2 b 2
1 c 3
2 d 4
3 e 5
3 a 9
I want to create a function fun(df, rowname, colname, valname)that takes a dataframe, rowname, colname, and value inputs and returns a data.frame or matrix with row names, column names and values as shown below
fun(df, "x","y","val") should return
1 2 3
a 1 NA 9
b NA 2 NA
c 3 NA NA
d NA 4 NA
e NA NA 5
The reshape2 library allows this kind of manipulation:
library(reshape2)
dcast(data=df, y~x, value.var = "val")
y 1 2 3
1 a 1 NA 9
2 b NA 2 NA
3 c 3 NA NA
4 d NA 4 NA
5 e NA NA 5
Suppose I have a list such like:
df1<-data.frame(n=letters[1:4], x=1:4, y=2:5, z=3:6)
df2<-data.frame(n=letters[2:5], x=2:5, y=3:6, z=4:7)
df3<-data.frame(n=letters[3:7], x=2:6, y=3:7, z=4:8)
ls<-list(df1, df2, df3)
ls
[[1]]
n x y z
1 a 1 2 3
2 b 2 3 4
3 c 3 4 5
4 d 4 5 6
[[2]]
n x y z
1 b 2 3 4
2 c 3 4 5
3 d 4 5 6
4 e 5 6 7
[[3]]
n x y z
1 c 2 3 4
2 d 3 4 5
3 e 4 5 6
4 f 5 6 7
5 g 6 7 8
what I wanted is to merger the first two columns of each data frame in the list by column n and a desired output would be:
n x1 x2 x3
1 a 1 NA NA
2 b 2 2 NA
3 c 3 3 2
4 d 4 4 3
5 e NA 5 4
6 f NA NA 5
7 g NA NA 6
And same thing for y and z:
n y1 y2 y3
1 a 2 NA NA
2 b 3 3 NA
3 c 4 4 3
4 d 5 5 4
5 e NA 6 5
6 f NA NA 6
7 g NA NA 7
n z1 z2 z3
1 a 3 NA NA
2 b 4 4 NA
3 c 5 5 4
4 d 6 6 5
5 e NA 7 6
6 f NA NA 7
7 g NA NA 8
We get the unique column names from the list of data.frames except the 'n' ('nm1'), loop through those (lapply(nm1,...), subset the columns of each of the 'data.frame' in 'ls' (lapply(ls, function(x) ...), and use Reduce, with merge to merge the datasets in the list.
nm1 <- setdiff(unlist(lapply(ls, names)), "n")
lapply(nm1, function(nm) setNames(Reduce(function(...)
merge(..., all=TRUE, by = "n"), lapply(ls,
function(x) x[c("n", nm)])), make.unique(c("n", rep(nm, length(nm1))))))
#[[1]]
# n x x.1 x.2
#1 a 1 NA NA
#2 b 2 2 NA
#3 c 3 3 2
#4 d 4 4 3
#5 e NA 5 4
#6 f NA NA 5
#7 g NA NA 6
#[[2]]
# n y y.1 y.2
#1 a 2 NA NA
#2 b 3 3 NA
#3 c 4 4 3
#4 d 5 5 4
#5 e NA 6 5
#6 f NA NA 6
#7 g NA NA 7
#[[3]]
# n z z.1 z.2
#1 a 3 NA NA
#2 b 4 4 NA
#3 c 5 5 4
#4 d 6 6 5
#5 e NA 7 6
#6 f NA NA 7
#7 g NA NA 8
NOTE: ls is a function name that lists the objects. It is better to avoid naming objects with known R functions.
Here is another base R method that uses do.call, data.frame, and cbind within a nested pair of lapply functions.
# get all levels of n across data frames
allN <- unique(unlist(sapply(ls, "[[", "n")))
# extract desired columns and provide names with setNames
lapply(names(ls[[1]])[-1], function(var) {
cbind("n"=allN, setNames(do.call(data.frame,
lapply(seq_along(ls), function(i) {
ls[[i]][[var]][match(allN, ls[[i]]$n, nomatch=NA)]
})), paste0(var, seq_along(ls))))
})
The first lapply runs through each of the variable names, the second lapply extracts the current variable from the each data frame in the list. In the middle, do.call makes the list a data.frame, setNames provides the desired names, and the n column is added with cbind.
In the innermost portion of the inner lapply, the code ls[[i]][[var]][match(allN, ls[[i]]$n, nomatch=NA)] is used to expand (and potentially reorder) the current vector according to the levels in allN. If the current vector is missing a level, the nomatch=NA tells match to instead return NA.
My dataframe, D is like this.
D$fit has both distance (0:6) and dg (1:3) info
D <- read.table(header = TRUE, text = "
distance dg fit
1 0 1 A
2 1 1 B
3 2 1 C
4 3 1 D
5 4 1 E
6 5 1 F
7 6 1 G
8 0 2 H
9 1 2 I
10 2 2 J
11 3 2 K
12 4 2 L
13 5 2 M
14 0 3 O
15 1 3 P
16 2 3 Q
17 3 3 R
")
I want to assign fit values to this matrix, md, corresponding to distance and dg.
md <- matrix(1:21, nrow = 7)
colnames(md) <- c(1:3)
rownames(md) <- c(0:6)
md[] <- NA
1 2 3
0 NA NA NA
1 NA NA NA
2 NA NA NA
3 NA NA NA
4 NA NA NA
5 NA NA NA
6 NA NA NA
I've tried but failed with this code
cmd = expand.grid(i=seq(0,6), j = seq(1,3))
i <- seq(0,6)
j <- seq(1,3)
md[i,j] <- D$fit[D$distance == cmd[1] & D$dg == cmd[2]]
We can use acast from library(reshape2)
library(reshape2)
acast(D, distance~dg, value.var="fit")
Or with reshape from base R
reshape(D, idvar="distance", timevar="dg", direction="wide")
I am new to R so am still getting my head around the way it works. My problem is as follows, I have a data frame and a prioritised list of columns (pl), I need:
To find the maximum value from the columns in pl for each row and create a new column with this value (df$max)
Using the priority list, subtract this maximum value from the priority value, ignoring NAs and returning the absolute difference
Probably better with an example:
My priority list is
pl <- c("E","D","A","B")
and the data frame is:
A B C D E F G
1 15 5 20 9 NA 6 1
2 3 2 NA 5 1 3 2
3 NA NA 3 NA NA NA NA
4 0 1 0 7 8 NA 6
5 1 2 3 NA NA 1 6
So for the first line the maximum is from column A (15) and the priority value is from column D (9) since E is a NA. The answer I want should look like this.
A B C D E F G MAX MAX-PR
1 15 5 20 9 NA 6 1 15 6
2 3 2 NA 5 1 3 2 5 4
3 NA NA 3 NA NA NA NA NA NA
4 0 1 0 7 8 NA 6 8 0
5 1 2 3 NA NA 1 6 2 1
How about this?
df$MAX <- apply(df[,pl], 1, max, na.rm = T)
df$MAX_PR <- df$MAX - apply(df[,pl], 1, function(x) x[!is.na(x)][1])
df$MAX[is.infinite(df$MAX)] <- NA
> df
# A B C D E F G MAX MAX_PR
# 1 15 5 20 9 NA 6 1 15 6
# 2 3 2 NA 5 1 3 2 5 4
# 3 NA NA 3 NA NA NA NA NA NA
# 4 0 1 0 7 8 NA 6 8 0
# 5 1 2 3 NA NA 1 6 2 1
Example:
df <- data.frame(A=c(1,NA,2,5,3,1),B=c(3,5,NA,6,NA,10),C=c(NA,3,4,5,1,4))
pl <- c("B","A","C")
#now we find the maximum per row, ignoring NAs
max.per.row <- apply(df,1,max,na.rm=T)
#and the first element according to the priority list, ignoring NAs
#(there may be a more efficient way to do this)
first.per.row <- apply(df[,pl],1, function(x) as.vector(na.omit(x))[1])
#and finally compute the difference
max.less.first.per.row <- max.per.row - first.per.row
Note that this code will break for any row that is all NA. There is no check against that.
Here a simple version. First , I take only pl columns , for each line I remove na then I compute the max.
df <- dat[,pl]
cbind(dat, t(apply(df, 1, function(x) {
x <- na.omit(x)
c(max(x),max(x)-x[1])
}
)
)
)
A B C D E F G 1 2
1 15 5 20 9 NA 6 1 15 6
2 3 2 NA 5 1 3 2 5 4
3 NA NA 3 NA NA NA NA -Inf NA
4 0 1 0 7 8 NA 6 8 0
5 1 2 3 NA NA 1 6 2 1