R: fill a new column in a data frame with a value by matching variables in reverse - r

I apologize for the title of this question. I can't figure out how a good way to briefly describe what I want to do.
I have something like this, with >8000 rows:
x y value_xy
A B 7
A C 2
B A 3
B C 6
C A 2
C B 1
I want to create a new column, value_yx, that looks like this:
x y value_xy value_yx
A B 7 3
A C 2 2
B A 3 7
B C 1 1
C A 2 2
C B 1 1
For each value of x and y, I want to have a new column that finds the value of y to x (as y appears later in the x column). Sometimes these values are equal, other times they aren't.
I have explored using for loops, ave(), and several other functions, but I haven't been able to make it work.

Try merge. The by.x and by.y arguments specify columns to be matched, and here the order of matching columns is reversed in by.y:
merge(x = df, y = df, by.x = c("x", "y"), by.y = c("y", "x"))
# x y value_xy.x value_xy.y
# 1 A B 7 3
# 2 A C 2 2
# 3 B A 3 7
# 4 B C 6 1
# 5 C A 2 2
# 6 C B 1 6

Looks like I was beat to it but an alternative solution with mapply
df$value_yx = mapply(function(x_flip, y_flip) df[df$x == y_flip & df$y == x_flip,]$value_xy, df$x, df$y)
# x y value_xy value_yx
#1 A B 7 3
#2 A C 2 2
#3 B A 3 7
#4 B C 6 1
#5 C A 2 2
#6 C B 1 6

xtabs will return a value-matrix that can be indexed by a two-column, character-valued matrix formed from the first two columns and are probably factors (hence the need for the as.character()-conversion:
> dfrm$value_yx <- xtabs(value_xy~x+y, dfrm)[
sapply(dfrm[2:1],as.character) ]
> dfrm
x y value_xy value_yx
1 A B 7 3
2 A C 2 2
3 B A 3 7
4 B C 6 1
5 C A 2 2
6 C B 1 6
--- See what is being indexed
> xtabs(value_xy~x+y, dfrm)
y
x A B C
A 0 7 2
B 3 0 6
C 2 1 0

Related

Simplest way to replace a list of values in a data frame with a list of new values

Say we have a data frame with a factor (Group) that is a grouping variable for a list of IDs:
set.seed(123)
data <- data.frame(Group = factor(sample(5,10, replace = T)),
ID = c(1:10))
In this example, the ID's belong to one of 5 Groups, labeled 1:5. We simply want to replace 1:5 with A:E. In other words, if Group == 1, we want to change it to A, if Group == 2, we want to change it to B, and so on. What is the simplest way to achieve this?
You may assign new labels= in a names list using factor once again.
data$Group1 <- factor(data$Group, labels=list("1"="A", "2"="B", "3"="C", "4"="D", "5"="E"))
## more succinct:
data$Group2 <- factor(data$Group, labels=setNames(list("A", "B", "C", "D", "E"), 1:5))
data
# Group ID Group1 Group2 Group3
# 1 3 1 C C C
# 2 3 2 C C C
# 3 2 3 B B B
# 4 2 4 B B B
# 5 3 5 C C C
# 6 5 6 E E E
# 7 4 7 D D D
# 8 1 8 A A A
# 9 2 9 B B B
# 10 3 10 C C C
This for general, if indeed capital letters are wanted see #RonakShah's solution.
You can use the built-in constant in R LETTERS :
data$new_group <- LETTERS[data$Group]
data
# Group ID new_group
#1 3 1 C
#2 3 2 C
#3 2 3 B
#4 2 4 B
#5 3 5 C
#6 5 6 E
#7 4 7 D
#8 1 8 A
#9 2 9 B
#10 3 10 C
Created a new column (new_group) here for comparison purposes. You can overwrite the same column if you wish to.

Generate pairwise data.frame of all combinations of two data.frame with different number of rows

I have to dataframes a and b that I want to combine in a final dataframe c
a <- data.frame(city=c("a","b","c"),detail=c(1,2,3))
b <- data.frame(city=c("x","y"),detail=c(5,6))
the dataframe c should look like
city.a detail.a city.b detail.b
1 a 1 x 5
2 a 1 y 6
3 b 2 x 5
4 b 2 y 6
5 c 3 x 5
6 c 3 y 6
I think I could use crossing from tidyr but for crossing(a,b) I get:
error: Column names `city`, `detail` must not be duplicated.
Use .name_repair to specify repair.
Yes, crossing is the right function but as the error message suggests that column names should be not be duplicated try to change the column names
names(a) <- paste0(names(a), ".a")
names(b) <- paste0(names(b), ".b")
tidyr::crossing(a, b)
# city.a detail.a city.b detail.b
# <fct> <dbl> <fct> <dbl>
#1 a 1 x 5
#2 a 1 y 6
#3 b 2 x 5
#4 b 2 y 6
#5 c 3 x 5
#6 c 3 y 6
crossing is a wrapper over expand_grid so after correcting the names you can also use it directly.
tidyr::expand_grid(a, b)
Here is a base R solution by using rep() + cbind(), which gives duplicated column names:
C <- `row.names<-`(cbind(a[rep(seq(nrow(a)),each = nrow(b)),],b),NULL)
such that
> C
city detail city detail
1 a 1 x 5
2 a 1 y 6
3 b 2 x 5
4 b 2 y 6
5 c 3 x 5
6 c 3 y 6
Or get a data frame having different column names by using data.frame():
C <- data.frame(a[rep(seq(nrow(a)),each = nrow(b)),],b,row.names = NULL)
such that
> C
city detail city.1 detail.1
1 a 1 x 5
2 a 1 y 6
3 b 2 x 5
4 b 2 y 6
5 c 3 x 5
6 c 3 y 6
With base R, we can use merge
merge(setNames(a, paste0(names(a), ".a")), b)
# city.a detail.a city detail
#1 a 1 x 5
#2 b 2 x 5
#3 c 3 x 5
#4 a 1 y 6
#5 b 2 y 6
#6 c 3 y 6

cumulative product in R across column

I have a dataframe in the following format
> x <- data.frame("a" = c(1,1),"b" = c(2,2),"c" = c(3,4))
> x
a b c
1 1 2 3
2 1 2 4
I'd like to add 3 new columns which is a cumulative product of the columns a b c, however I need a reverse cumulative product i.e. the output should be
row 1:
result_d = 1*2*3 = 6 , result_e = 2*3 = 6, result_f = 3
and similarly for row 2
The end result will be
a b c result_d result_e result_f
1 1 2 3 6 6 3
2 1 2 4 8 8 4
the column names do not matter this is just an example. Does anyone have any idea how to do this?
as per my comment, is it possible to do this on a subset of columns? e.g. only for columns b and c to return:
a b c results_e results_f
1 1 2 3 6 3
2 1 2 4 8 4
so that column "a" is effectively ignored?
One option is to loop through the rows and apply cumprod over the reverse of elements and then do the reverse
nm1 <- paste0("result_", c("d", "e", "f"))
x[nm1] <- t(apply(x, 1,
function(x) rev(cumprod(rev(x)))))
x
# a b c result_d result_e result_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4
Or a vectorized option is rowCumprods
library(matrixStats)
x[nm1] <- rowCumprods(as.matrix(x[ncol(x):1]))[,ncol(x):1]
temp = data.frame(Reduce("*", x[NCOL(x):1], accumulate = TRUE))
setNames(cbind(x, temp[NCOL(temp):1]),
c(names(x), c("res_d", "res_e", "res_f")))
# a b c res_d res_e res_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4

Subseting data frame by another data frame

The data is as follows:
> x
a b
1 1 a
2 2 a
3 3 a
4 1 b
5 2 b
6 3 b
> y
a b
1 2 a
2 3 a
3 3 b
My goal is to compare both data frames, and for each row in x indicate whether equivalent row exists in y. All of the y rows are actually contained in x, so I would like to end up with something like this:
> x
a b intersect.x.y
1 1 a F
2 2 a T
3 3 a T
4 1 b F
5 2 b F
6 3 b T
How about that?
How about this?
x$rn <- 1:nrow(x)
xyrows <- merge(x,y)$rn # maybe you just want to look at the merge ...?
x$iny <- FALSE
x$iny[xyrows] <- TRUE
I suspect there is a more standard approach, but this way is easy to understand.

Convert datafile from wide to long format to fit ordinal mixed model in R

I am dealing with a dataset that is in wide format, as in
> data=read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
> data
factor1 factor2 count_1 count_2 count_3
1 a a 1 2 0
2 a b 3 0 0
3 b a 1 2 3
4 b b 2 2 0
5 c a 3 4 0
6 c b 1 1 0
where factor1 and factor2 are different factors which I would like to take along (in fact I have more than 2, but that shouldn't matter), and count_1 to count_3 are counts of aggressive interactions on an ordinal scale (3>2>1). I would now like to convert this dataset to long format, to get something like
factor1 factor2 aggression
1 a a 1
2 a a 2
3 a a 2
4 a b 1
5 a b 1
6 a b 1
7 b a 1
8 b a 2
9 b a 2
10 b a 3
11 b a 3
12 b a 3
13 b b 1
14 b b 1
15 b b 2
16 b b 2
17 c a 1
18 c a 1
19 c a 1
20 c a 2
21 c a 2
22 c a 2
23 c a 2
24 c b 1
25 c b 2
Would anyone happen to know how to do this without using for...to loops, e.g. using package reshape2? (I realize it should work using melt, but I just haven't been able to figure out the right syntax yet)
Edit: For those of you that would also happen to need this kind of functionality, here is Ananda's answer below wrapped into a little function:
widetolong.ordinal<-function(data,factors,responses,responsename) {
library(reshape2)
data$ID=1:nrow(data) # add an ID to preserve row order
dL=melt(data, id.vars=c("ID", factors)) # `melt` the data
dL=dL[order(dL$ID), ] # sort the molten data
dL[,responsename]=match(dL$variable,responses) # convert reponses to ordinal scores
dL[,responsename]=factor(dL[,responsename],ordered=T)
dL=dL[dL$value != 0, ] # drop rows where `value == 0`
out=dL[rep(rownames(dL), dL$value), c(factors, responsename)] # use `rep` to "expand" `data.frame` & drop unwanted columns
rownames(out) <- NULL
return(out)
}
# example
data <- read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
widetolong.ordinal(data,c("factor1","factor2"),c("count_1","count_2","count_3"),"aggression")
melt from "reshape2" will only get you part of the way through this problem. To go the rest of the way, you just need to use rep from base R:
data <- read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
library(reshape2)
## Add an ID if the row order is importantt o you
data$ID <- 1:nrow(data)
## `melt` the data
dL <- melt(data, id.vars=c("ID", "factor1", "factor2"))
## Sort the molten data, if necessary
dL <- dL[order(dL$ID), ]
## Extract the numeric portion of the "variable" variable
dL$aggression <- gsub("count_", "", dL$variable)
## Drop rows where `value == 0`
dL <- dL[dL$value != 0, ]
## Use `rep` to "expand" your `data.frame`.
## Drop any unwanted columns at this point.
out <- dL[rep(rownames(dL), dL$value), c("factor1", "factor2", "aggression")]
This is what the output finally looks like. If you want to remove the funny row names, just use rownames(out) <- NULL.
out
# factor1 factor2 aggression
# 1 a a 1
# 7 a a 2
# 7.1 a a 2
# 2 a b 1
# 2.1 a b 1
# 2.2 a b 1
# 3 b a 1
# 9 b a 2
# 9.1 b a 2
# 15 b a 3
# 15.1 b a 3
# 15.2 b a 3
# 4 b b 1
# 4.1 b b 1
# 10 b b 2
# 10.1 b b 2
# 5 c a 1
# 5.1 c a 1
# 5.2 c a 1
# 11 c a 2
# 11.1 c a 2
# 11.2 c a 2
# 11.3 c a 2
# 6 c b 1
# 12 c b 2

Resources