How to manipulate multiple columns in R - r

I have the following table:
X Y
A 4 8
B 2 6
C 5 4
D 6 3
E 9 13
But I would like to re-arrange this to look like:
AX AY BX BY CX CY......
4 8 2 6 5 4
I am working in R and get the table by doing
table(db[,1],db[,2])
How can I change the command to get the desired output?

If you do not care about the names and you have numeric data then the easiest solution would be to coerce to a matrix and then to a vector like so:
as.vector( t( x ) )
# [1] 4 8 2 6 5 4 6 3 9 13
If you also want to preserve the names, use expand.grid to get the combinations...
# The data
y <- as.vector( t( x ) )
# Combinations of row and column names
nms <- expand.grid( colnames(x) , rownames(x) )
# Rename vector with desired names
names(y) <- paste0( nms[,2] , nms[,1] )
#AX AY BX BY CX CY DX DY EX EY
# 4 8 2 6 5 4 6 3 9 13

Assuming your data is set up this way:
db <- data.frame(
c(rep("A", 4), rep("B", 2), rep("C", 5), rep("D", 6), rep("E", 9),
rep("A", 8), rep("B", 6), rep("C", 4), rep("D", 3), rep("E", 13)),
c(rep("X", 26), rep("Y", 34)),
stringsAsFactors = FALSE)
tab <- table(db[,1],db[,2])
You could do this in a one-liner:
array(tab, dimnames = list(do.call("paste0", expand.grid(dimnames(tab)))))
AX BX CX DX EX AY BY CY DY EY
4 2 5 6 9 8 6 4 3 13

Related

Loop to execute in different dataframes in r [duplicate]

I am new to R, and this is a very simple question. I've found a lot of similar things to what I want but not exactly it. Basically I have multiple data frames and I simply want to run the same function across all of them. A for-loop could work but I'm not sure how to set it up properly to call data frames. It also seems most prefer the lapply approach with R. I've played with the get function as well to no avail. I apologize if this is a duplicated question. Any help would be greatly appreciated!
Here's my over simplified example:
2 data frames: df1, df2
df1
start stop ID
0 10 x
10 20 y
20 30 z
df2
start stop ID
0 10 a
10 20 b
20 30 c
what I want is a 4th column with the average of start and stop for both dfs
df1
start stop ID Avg
0 10 x 5
10 20 y 15
20 30 z 25
I can do this one data frame at a time with:
df1$Avg <- rowMeans(subset(df1, select = c(start, stop)), na.rm = TRUE)
but I want to run it on all of the dataframes.
Make a list of data frames then use lapply to apply the function to them all.
df.list <- list(df1,df2,...)
res <- lapply(df.list, function(x) rowMeans(subset(x, select = c(start, stop)), na.rm = TRUE))
# to keep the original data.frame also
res <- lapply(df.list, function(x) cbind(x,"rowmean"=rowMeans(subset(x, select = c(start, stop)), na.rm = TRUE)))
The lapply will then feed in each data frame as x sequentially.
Put them into a list and then run rowMeans over the list.
df1 <- data.frame(x = rep(3, 5), y = seq(1, 5, 1), ID = letters[1:5])
df2 <- data.frame(x = rep(5, 5), y = seq(2, 6, 1), ID = letters[6:10])
lapply(list(df1, df2), function(w) { w$Avg <- rowMeans(w[1:2]); w })
[[1]]
x y ID Avg
1 3 1 a 2.0
2 3 2 b 2.5
3 3 3 c 3.0
4 3 4 d 3.5
5 3 5 e 4.0
[[2]]
x y ID Avg
1 5 2 f 3.5
2 5 3 g 4.0
3 5 4 h 4.5
4 5 5 i 5.0
5 5 6 j 5.5
In case you want all the outputs in the same file this may help.
df1 <- data.frame(x = rep(3, 5), y = seq(1, 5, 1), ID = letters[1:5])
df2 <- data.frame(x = rep(5, 5), y = seq(2, 6, 1), ID = letters[6:10])
z=list(df1,df2)
df=NULL
for (i in z) {
i$Avg=(i$x+i$y)/2
df<-rbind(df,i)
print (df)
}
> df
x y ID Avg
1 3 1 a 2.0
2 3 2 b 2.5
3 3 3 c 3.0
4 3 4 d 3.5
5 3 5 e 4.0
6 5 2 f 3.5
7 5 3 g 4.0
8 5 4 h 4.5
9 5 5 i 5.0
10 5 6 j 5.5
Here's another possible solution using a for loop. I've had the same problem (with more datasets) a few days ago and other solutions did not work.
Say you have n datasets :
df1 <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[24:26])
df2 <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[1:3])
...
dfn <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[n:n+2])
The first thing to do is to make a list of the dfs:
df.list<-lapply(1:n, function(x) eval(parse(text=paste0("df", x)))) #In order to store all datasets in one list using their name
names(df.list)<-lapply(1:n, function(x) paste0("df", x)) #Adding the name of each df in case you want to unlist the list afterwards
Afterwards, you can use the for loop (that's the most important part):
for (i in 1:length(df.list)) {
df.list[[i]][["Avg"]]<-rowMeans(df.list[[i]][1:2])
}
And you have (in the case your list only includes the two first datasets):
> df.list
[[1]]
start stop ID Avg
1 0 10 x 5
2 10 20 y 15
3 20 30 z 25
[[2]]
start stop ID Avg
1 0 10 a 5
2 10 20 b 15
3 20 30 c 25
Finally, if you want your modified datasets from the list back in the global environment, you can do:
list2env(df.list,.GlobalEnv)
This technique can be applied to n datasets and other functions.
I find it to be the most flexible solution.

Add column inside named matrix

Suppose I have the following matrix:
m <- matrix(1:12, nrow = 3, dimnames = list(c("a", "b", "c"), c("w", "x", "y", "z")))
# w x y z
# a 1 4 7 10
# b 2 5 8 11
# c 3 6 9 12
How can I add a column with the values c(13, 14, 15) between column x and y without knowing where x and y are?
Using number ranges I know how to do this using cbind.
cbind(m[,1:2], c(13, 14, 15), m[,3:4])
# w x y z
# a 1 4 13 7 10
# b 2 5 14 8 11
# c 3 6 15 9 12
For named columns, it'd be neat if I could supply the column ranges with m[,:"x"] and m[,"y":] of some sort, but unfortunately that doesn't work.
Additionally, if possible, giving that column its own header name during the insertion process would be nice.
EDIT: I should have specified that x and y always are in order, so adding the column after x would have been enough. Thanks for the more general answers as well!
When you can not assume that x comes before y and there is no need that they are following each without a gap you can try:
i <- seq_len(min(match(c("x", "y"), colnames(m))))
cbind(m[,i], v=c(13, 14, 15), m[,-i])
# w x v y z
#a 1 4 13 7 10
#b 2 5 14 8 11
#c 3 6 15 9 12
In case they are ordered, that it will be enough to put it after x like:
i <- seq_len(match("x", colnames(m)))
cbind(m[,i], v=c(13, 14, 15), m[,-i])
you may found the columns positions by names and insert the new column properly:
x_pos <- which(colnames(m) == "x")
y_pos <- which(colnames(m) == "y")
m <- cbind(m[,1:x_pos], new=c(13, 14, 15), m[,y_pos:ncol(m)])
You can use which to find the desired column and assign a name in cbind, i.e.
cbind(m[, seq(which(colnames(m) == 'x'))],
w = c(13, 14, 15),
m[, (which(colnames(m) == 'y'):ncol(m))])
# w x w y z
#a 1 4 13 7 10
#b 2 5 14 8 11
#c 3 6 15 9 12
It's not exactly pretty but you can do this
cbind(m[,1:(which(dimnames(m)[[2]]=="x"))],
t=c(13, 14, 15),
m[,(which(dimnames(m)[[2]]=="y")):dim(m)[2]])
You can use this function :
insert_a_column <- function(mat, first_col,second_col, new_col, vec) {
#Get index of first column to match
one <- match(first_col, colnames(mat))
#Get index of second column to match
two <- match(second_col, colnames(mat))
#Add the middle column and combine the data
new_mat <- cbind(mat[,1:one, drop = FALSE], vec,
mat[, (one + 1):ncol(mat), drop = FALSE])
#rename the new column
colnames(new_mat)[one + 1] <- new_col
#Return the matrix.
return(new_mat)
}
insert_a_column(m, "x", "y", "a", c(13, 14, 15))
# w x a y z
#a 1 4 13 7 10
#b 2 5 14 8 11
#c 3 6 15 9 12
insert_a_column(m, "y", "z", "a", c(13, 14, 15))
# w x y a z
#a 1 4 7 13 10
#b 2 5 8 14 11
#c 3 6 9 15 12

applying function to multiple dataframes programatically [duplicate]

I am new to R, and this is a very simple question. I've found a lot of similar things to what I want but not exactly it. Basically I have multiple data frames and I simply want to run the same function across all of them. A for-loop could work but I'm not sure how to set it up properly to call data frames. It also seems most prefer the lapply approach with R. I've played with the get function as well to no avail. I apologize if this is a duplicated question. Any help would be greatly appreciated!
Here's my over simplified example:
2 data frames: df1, df2
df1
start stop ID
0 10 x
10 20 y
20 30 z
df2
start stop ID
0 10 a
10 20 b
20 30 c
what I want is a 4th column with the average of start and stop for both dfs
df1
start stop ID Avg
0 10 x 5
10 20 y 15
20 30 z 25
I can do this one data frame at a time with:
df1$Avg <- rowMeans(subset(df1, select = c(start, stop)), na.rm = TRUE)
but I want to run it on all of the dataframes.
Make a list of data frames then use lapply to apply the function to them all.
df.list <- list(df1,df2,...)
res <- lapply(df.list, function(x) rowMeans(subset(x, select = c(start, stop)), na.rm = TRUE))
# to keep the original data.frame also
res <- lapply(df.list, function(x) cbind(x,"rowmean"=rowMeans(subset(x, select = c(start, stop)), na.rm = TRUE)))
The lapply will then feed in each data frame as x sequentially.
Put them into a list and then run rowMeans over the list.
df1 <- data.frame(x = rep(3, 5), y = seq(1, 5, 1), ID = letters[1:5])
df2 <- data.frame(x = rep(5, 5), y = seq(2, 6, 1), ID = letters[6:10])
lapply(list(df1, df2), function(w) { w$Avg <- rowMeans(w[1:2]); w })
[[1]]
x y ID Avg
1 3 1 a 2.0
2 3 2 b 2.5
3 3 3 c 3.0
4 3 4 d 3.5
5 3 5 e 4.0
[[2]]
x y ID Avg
1 5 2 f 3.5
2 5 3 g 4.0
3 5 4 h 4.5
4 5 5 i 5.0
5 5 6 j 5.5
In case you want all the outputs in the same file this may help.
df1 <- data.frame(x = rep(3, 5), y = seq(1, 5, 1), ID = letters[1:5])
df2 <- data.frame(x = rep(5, 5), y = seq(2, 6, 1), ID = letters[6:10])
z=list(df1,df2)
df=NULL
for (i in z) {
i$Avg=(i$x+i$y)/2
df<-rbind(df,i)
print (df)
}
> df
x y ID Avg
1 3 1 a 2.0
2 3 2 b 2.5
3 3 3 c 3.0
4 3 4 d 3.5
5 3 5 e 4.0
6 5 2 f 3.5
7 5 3 g 4.0
8 5 4 h 4.5
9 5 5 i 5.0
10 5 6 j 5.5
Here's another possible solution using a for loop. I've had the same problem (with more datasets) a few days ago and other solutions did not work.
Say you have n datasets :
df1 <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[24:26])
df2 <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[1:3])
...
dfn <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[n:n+2])
The first thing to do is to make a list of the dfs:
df.list<-lapply(1:n, function(x) eval(parse(text=paste0("df", x)))) #In order to store all datasets in one list using their name
names(df.list)<-lapply(1:n, function(x) paste0("df", x)) #Adding the name of each df in case you want to unlist the list afterwards
Afterwards, you can use the for loop (that's the most important part):
for (i in 1:length(df.list)) {
df.list[[i]][["Avg"]]<-rowMeans(df.list[[i]][1:2])
}
And you have (in the case your list only includes the two first datasets):
> df.list
[[1]]
start stop ID Avg
1 0 10 x 5
2 10 20 y 15
3 20 30 z 25
[[2]]
start stop ID Avg
1 0 10 a 5
2 10 20 b 15
3 20 30 c 25
Finally, if you want your modified datasets from the list back in the global environment, you can do:
list2env(df.list,.GlobalEnv)
This technique can be applied to n datasets and other functions.
I find it to be the most flexible solution.

remove duplicate base on 2 columns of data [duplicate]

This question already has an answer here:
remove duplicate values based on 2 columns
(1 answer)
Closed 4 years ago.
I have a set of data:
x <- c(rep("A", 3), rep("B", 3), rep("C",2))
y <- c(1,1,2,4,1,1,2,2)
z <- c(rep("E", 1), rep("F", 4), rep("G",3))
df <-data.frame(x,y,z)
I only want to remove the duplicate row if both column x and column z are duplicated.
In this case, after applying the code, row 2,3 will left with 1 row, row 4,5 will left with 1 row, row 7,8 will left with 1 row
How to do it?
You can use a simple condition to subset your data:
x <- c(rep("A", 3), rep("B", 3), rep("C",2))
y <- c(1,1,2,4,1,1,2,2)
z <- c(rep("A", 1), rep("B", 4), rep("C",3))
df <-data.frame(x,y,z)
df
df[!df$x == df$z,] # the ! excludes all rows for which x == z is TRUE
x y z
2 A 1 B
3 A 2 B
6 B 1 C
Edit: As #RonakShah commented, to exclude duplicated rows, use
df[!duplicated(df[c("x", "z")]),]
or
df[!duplicated(df[c(1, 3)]),]
x y z
1 A 1 A
2 A 1 B
4 B 4 B
6 B 1 C
7 C 2 C

Same function over multiple data frames in R

I am new to R, and this is a very simple question. I've found a lot of similar things to what I want but not exactly it. Basically I have multiple data frames and I simply want to run the same function across all of them. A for-loop could work but I'm not sure how to set it up properly to call data frames. It also seems most prefer the lapply approach with R. I've played with the get function as well to no avail. I apologize if this is a duplicated question. Any help would be greatly appreciated!
Here's my over simplified example:
2 data frames: df1, df2
df1
start stop ID
0 10 x
10 20 y
20 30 z
df2
start stop ID
0 10 a
10 20 b
20 30 c
what I want is a 4th column with the average of start and stop for both dfs
df1
start stop ID Avg
0 10 x 5
10 20 y 15
20 30 z 25
I can do this one data frame at a time with:
df1$Avg <- rowMeans(subset(df1, select = c(start, stop)), na.rm = TRUE)
but I want to run it on all of the dataframes.
Make a list of data frames then use lapply to apply the function to them all.
df.list <- list(df1,df2,...)
res <- lapply(df.list, function(x) rowMeans(subset(x, select = c(start, stop)), na.rm = TRUE))
# to keep the original data.frame also
res <- lapply(df.list, function(x) cbind(x,"rowmean"=rowMeans(subset(x, select = c(start, stop)), na.rm = TRUE)))
The lapply will then feed in each data frame as x sequentially.
Put them into a list and then run rowMeans over the list.
df1 <- data.frame(x = rep(3, 5), y = seq(1, 5, 1), ID = letters[1:5])
df2 <- data.frame(x = rep(5, 5), y = seq(2, 6, 1), ID = letters[6:10])
lapply(list(df1, df2), function(w) { w$Avg <- rowMeans(w[1:2]); w })
[[1]]
x y ID Avg
1 3 1 a 2.0
2 3 2 b 2.5
3 3 3 c 3.0
4 3 4 d 3.5
5 3 5 e 4.0
[[2]]
x y ID Avg
1 5 2 f 3.5
2 5 3 g 4.0
3 5 4 h 4.5
4 5 5 i 5.0
5 5 6 j 5.5
In case you want all the outputs in the same file this may help.
df1 <- data.frame(x = rep(3, 5), y = seq(1, 5, 1), ID = letters[1:5])
df2 <- data.frame(x = rep(5, 5), y = seq(2, 6, 1), ID = letters[6:10])
z=list(df1,df2)
df=NULL
for (i in z) {
i$Avg=(i$x+i$y)/2
df<-rbind(df,i)
print (df)
}
> df
x y ID Avg
1 3 1 a 2.0
2 3 2 b 2.5
3 3 3 c 3.0
4 3 4 d 3.5
5 3 5 e 4.0
6 5 2 f 3.5
7 5 3 g 4.0
8 5 4 h 4.5
9 5 5 i 5.0
10 5 6 j 5.5
Here's another possible solution using a for loop. I've had the same problem (with more datasets) a few days ago and other solutions did not work.
Say you have n datasets :
df1 <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[24:26])
df2 <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[1:3])
...
dfn <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[n:n+2])
The first thing to do is to make a list of the dfs:
df.list<-lapply(1:n, function(x) eval(parse(text=paste0("df", x)))) #In order to store all datasets in one list using their name
names(df.list)<-lapply(1:n, function(x) paste0("df", x)) #Adding the name of each df in case you want to unlist the list afterwards
Afterwards, you can use the for loop (that's the most important part):
for (i in 1:length(df.list)) {
df.list[[i]][["Avg"]]<-rowMeans(df.list[[i]][1:2])
}
And you have (in the case your list only includes the two first datasets):
> df.list
[[1]]
start stop ID Avg
1 0 10 x 5
2 10 20 y 15
3 20 30 z 25
[[2]]
start stop ID Avg
1 0 10 a 5
2 10 20 b 15
3 20 30 c 25
Finally, if you want your modified datasets from the list back in the global environment, you can do:
list2env(df.list,.GlobalEnv)
This technique can be applied to n datasets and other functions.
I find it to be the most flexible solution.

Resources