Converting counts to individual observations in r - r

I have a data set that looks as follows
df <- data.frame( name = c("a", "b", "c"),
judgement1= c(5, 0, NA),
judgement2= c(1, 1, NA),
judgement3= c(2, 1, NA))
I want to reshape the dataframe to look like this
# name judgement1 judgement2 judgement3
# a 1 0 0
# a 1 0 0
# a 1 0 0
# a 1 0 0
# a 1 0 0
# b 1 0 0
# b 0 1 0
# b 0 0 1
And so on. I have seen that untable is recommended on some other threads, but it does not appear to work with the current version of r. Is there a package that can convert summarised counts into individual observations?

You could try something like this:
df <- data.frame( name = c("a", "b", "c"),
judgement1= c(5, 0, NA),
judgement2= c(1, 1, NA),
judgement3= c(2, 1, NA))
rep.vec <- colSums(df[colnames(df) %in% paste0("judgement", (1:nrow(df)), sep="")], na.rm = TRUE)
want <- data.frame(name=df$name, cbind(diag(nrow(df))))
colnames(want)[-1] <- paste0("judgement", (1:nrow(df)), sep="")
(want <- want[rep(1:nrow(want), rep.vec), ])

I wrote a function that works to give you your desired output:
untabl <- function(df, id.col, count.cols) {
df[is.na(df)] <- 0 # replace NAs
out <- lapply(count.cols, function(x) { # for each column with counts
z <- df[rep(1:nrow(df), df[,x]), ] # replicate rows
z[, -c(id.col)] <- 0 # set all other columns to zero
z[, x] <- 1 # replace the count values with 1
z
})
out <- do.call(rbind, out) # combine the list
out <- out[order(out[,c(id.col)]),] # reorder (you can change this)
rownames(out) <- NULL # return to simple row numbers
out
}
untabl(df = df, id.col = 1, count.cols = c(2,3,4))
# name judgement1 judgement2 judgement3
#1 a 1 0 0
#2 a 1 0 0
#3 a 1 0 0
#4 a 1 0 0
#5 a 1 0 0
#6 a 0 1 0
#7 b 0 1 0
#8 a 0 0 1
#9 a 0 0 1
#10 b 0 0 1
And for your reference, reshape::untable consists of the following code:
function (df, num)
{
df[rep(1:nrow(df), num), ]
}

Related

How to change values of R cells (dataframe) based on a condition for specific rows>?

I have the following dataframe,
C1
C2
C3
0
0
0
1
1
0
0
0
0
1
1
0
0
0
0
I want to now apply the following condition on the dataframe for specific indexes only.
C1 should be equal to 0
A random number should be less than 0.5
If the above conditions match, I want to change the value of the Cell in C1 and C2 to 1 else do nothing.
I am trying the following: (rowIndex is the specific indexes on which I want to apply the conditions)
apply(DF[rowsIndex,], 2, fun)
where fun is:
fun<- function(x) {
ifelse(x==0,ifelse(runif(n=1)<0.5,x <- 1,x),x )
print(x)
}
My questions are:
In my function, How do I apply the conditions to a certain column only i.e C1 (I have tried using DF[rowsIndex,c(1)], but gives an error
Is there any other approach I can take Since this approach is not giving me any results and the same DF is printed.
Thanks
If you want to stay in base R:
#your dataframe
DF <- data.frame(C1 = c(0, 1, 0, 1, 0),
C2 = c(0, 1, 0, 1, 0),
C3 = c(0, 0, 0, 0, 0))
fun<- function(x) {
if(x[1]==0 & runif(n=1)<0.5) {
x[1:2] <- 1
}
return(x)
}
#your selection of rows you want to process
rowsIndex <- c(1, 2, 3, 4)
#Using MARGIN = 1 applies the function to the rows of a dataframe
#this returns a dataframe containing your selected and processed rows
DF_processed <- t(apply(DF[rowsIndex,], 1, fun))
#replace the selected rows in the original DF by the processed rows
DF[rowsIndex, ] <- DF_processed
print(DF)
Something like this?
library(dplyr)
df %>%
mutate(across(c(C1, C2), ~ifelse(C1 == 0 & runif(1) < 0.5, 1, .)))
C1 C2 C3
1 1 0 0
2 1 1 0
3 1 0 0
4 1 1 0
5 1 0 0
Applying it to your function:
fun<- function(df, x, y) {
df %>%
mutate(across(c({{x}}, {{y}}), ~ifelse({{x}} == 0 & runif(1) < 0.5, 1, .)))
}
fun(df, C1, C2)
C1 C2 C3
1 0 0 0
2 1 1 0
3 0 0 0
4 1 1 0
5 0 0 0

Adjacency Matrix from a dataframe

I am trying to convert an edgelist to an adjacent matrix.
Below is the sample data
#Sample Data
User<-c("1","1","2","3","4")
v1 <- c("b", "b", "a", "d", "c")
v2 <- c("c", "d", "c", "a", "a")
v3 <- c(0, 0, "d", 0, "b")
v4 <- c(0, 0, 0, 0, 0)
v5 <- c(0, 0, 0, 0, 0)
my_data<-data.frame(User, v1, v2, v3, v4, v5)
my_data
If you run this code you will get the below as output,
User v1 v2 v3 v4 v5
1 b c 0 0 0
1 b d 0 0 0
2 a c d 0 0
3 d a 0 0 0
4 c a b 0 0
Using the data, I want to create an adjacent matrix that looks like follows:
a b c d
a 0 0 2 2
b 0 0 1 1
c 2 1 0 1
d 2 1 1 0
Basically, the desired output diplays the count how many times each pair appeared in column v1~v5 in the sample data frame.
I have tried to use AdjacencyFromEdgelist function from dils library, also tried to create a matrix shell with NAs and fill out the matrix by looping through the dataframe.
However, I could not get neither way to work.
I think this may be close to what you have in mind. In the rows where there are more than 2 vertices, I considered every existing pairs:
library(igraph)
do.call(rbind, my_data[-1] |>
apply(1, \(x) x[x != 0]) |>
lapply(\(x) t(combn(x, m = 2)))) |>
graph_from_edgelist(directed = FALSE) %>%
as_adjacency_matrix()
4 x 4 sparse Matrix of class "dgCMatrix"
b c d a
b . 2 1 1
c 2 . 1 2
d 1 1 . 2
a 1 2 2 .
Or without the pip operator in base R:
tmp <- apply(my_data[-1], 1, function(x) x[x != 0])
tmp <- do.call(rbind, lapply(tmp, function(x) t(combn(x, m = 2))))
my_graph <- graph_from_edgelist(tmp, directed = FALSE)
adj_mat <- as_adjacency_matrix(my_graph)
adj_mat
Another attempt, minus the need to calculate all the combinations with combn
sel <- my_data[-1] != 0
dat <- data.frame(row=row(my_data[-1])[sel], value = my_data[-1][sel])
out <- crossprod(table(dat))
diag(out) <- 0
out
# value
#value a b c d
# a 0 1 2 2
# b 1 0 2 1
# c 2 2 0 1
# d 2 1 1 0
Matches the result from #AnoushiravanR:
adj_mat[c("a","b","c","d"), c("a","b","c","d")]
#4 x 4 sparse Matrix of class "dgCMatrix"
# a b c d
#a . 1 2 2
#b 1 . 2 1
#c 2 2 . 1
#d 2 1 1 .
Another igraph option
do.call(
rbind,
combn(df, 2, setNames, nm = c("from", "to"), simplify = FALSE)
) %>%
filter(from > 0 & to > 0) %>%
arrange(from) %>%
graph_from_data_frame(directed = FALSE) %>%
get.adjacency(sparse = FALSE)
gives
a b c d
a 0 1 2 2
b 1 0 2 1
c 2 2 0 1
d 2 1 1 0

Replace column values based on column name

I have a data frame with several binary variables: x1, x2, ... x100. I want to replace the entry 1 in each column with the number in the name of the column, i.e.:
data$x2[data$x2 == 1] <- 2
data$x3[data$x3 == 1] <- 3
data$x4[data$x4 == 1] <- 4
data$x5[data$x5 == 1] <- 5
...
How can I achieve this in a loop?
Using col:
# example data
set.seed(1); d <- as.data.frame(matrix(sample(0:1, 12, replace = TRUE), nrow = 3))
names(d) <- paste0("x", seq(ncol(d)))
d
# x1 x2 x3 x4
# 1 0 0 0 1
# 2 1 1 0 0
# 3 0 0 1 0
ix <- d == 1
d[ ix ] <- col(d)[ ix ]
d
# x1 x2 x3 x4
# 1 0 0 0 4
# 2 1 2 0 0
# 3 0 0 3 0
dplyr approach (using zx8754's data):
library(dplyr)
d %>%
mutate(across(starts_with('x'), ~ . * as.numeric(gsub('x', '', cur_column()))))
#> x1 x2 x3 x4
#> 1 0 0 0 4
#> 2 1 2 0 0
#> 3 0 0 3 0
Created on 2021-05-26 by the reprex package (v2.0.0)
Here is a base R solution with a lapply loop.
data[-1] <- lapply(names(data)[-1], function(k){
n <- as.integer(sub("[^[:digit:]]*", "", k))
data[data[[k]] == 1, k] <- n
data[[k]]
})
data
Test data.
set.seed(2021)
data <- replicate(6, rbinom(10, 1, 0.5))
data <- as.data.frame(data)
names(data) <- paste0("x", 1:6)
A solution based on a simple for loop is below (otherwise similar to the accepted answer using lapply):
for (i in 2:100) {
k <- paste0('x', i)
data[data[[k]] == 1, k] <- i
}

Efficietly repeat data.table in a list, sequentially replacing columns with the same names from another data.table in a loop

I have two data.tables:
x <- data.table(a = c(1, 2, 3, 4, 1), b = c(2, 3, 4, 1, 2), c = c(3, 4, 1, 2, 3))
y <- data.table(a = c(1, 0, 0, 0, 1), b = c(0, 1, 0, 0, 0), c = c(0, 0, 0, 0, 1))
What I am trying to achieve is to create a list of y with length of the number of its columns where every next column is replaced by the values of the same column in x. The desired result shall look like this:
[[1]]
a b c
1: 1 0 0
2: 2 1 0
3: 3 0 0
4: 4 0 0
5: 1 0 1
[[2]]
a b c
1: 1 2 0
2: 0 3 0
3: 0 4 0
4: 0 1 0
5: 1 2 1
[[3]]
a b c
1: 1 0 3
2: 0 1 4
3: 0 0 1
4: 0 0 2
5: 1 0 3
What I tried:
z <- lapply(names(x), function(i) {
x[ , i, with = FALSE]
})
w <- rep(list(y), ncol(y))
myfun <- function(obj1, obj2) {
cbind(obj1, obj2)
}
u <- Map(myfun, obj1 = z, obj2 = w)
u <- lapply(u, function(i) {
setcolorder(i[ , unique(names(i)), with = FALSE], names(x))
})
It gives me the desired result, but is very clumsy and requires too many step, hence, it is probably inefficient with larger data.tables. I would like to have it more in the data.table way. I tried something which I assumed would work:
lapply(names(x), function(i) {
y[ , (i) := x[ , i, with = FALSE]]
})
However, it returns the first list component empty and copies all the values of x into the next list components.
Can someone help?
Here, we may need a copy of the 'y' while creating the list 'w' instead of
w <- rep(list(y), ncol(y))
It is tempting to go for the below expression of rep. However, that have an issue in the w elements as these are pointing to the same location in memory
w <- rep(list(copy(x)), ncol(y))
The assignment (:=) by reference changes the column values in each loop because they reference to the same object in memory. In the first case, after the assignment, it changes 'y' too along with 'w' list elements. Second case, it can change only 'w' and leave 'y' because we copyied. To understand the behavior, do a set assignment in a for loop
for(j in seq_along(x)) {print(w[[j]][[j]])
set(w[[j]], i = NULL, j =j, x[[j]])
print("----")
print(w[[j]])
}
Inorder to avoid that, use replicate
w <- replicate(ncol(y), copy(y), simplify = FALSE)
and then do the for loop (after recreating the objects again as the values were replaced from the previous run)
for(j in seq_along(x)) {print(w[[j]][[j]])
set(w[[j]], i = NULL, j =j, x[[j]])
print("----")
print(w[[j]])
}
Or a Map based assignment
Map(function(u, v) u[, (v) := x[[v]]][], w, names(x))
#[[1]]
# a b c
#1: 1 0 0
#2: 2 1 0
#3: 3 0 0
#4: 4 0 0
#5: 1 0 1
#[[2]]
# a b c
#1: 1 2 0
#2: 0 3 0
#3: 0 4 0
#4: 0 1 0
#5: 1 2 1
#[[3]]
# a b c
#1: 1 0 3
#2: 0 1 4
#3: 0 0 1
#4: 0 0 2
#5: 1 0 3
Instead of assignment by reference, it can be done with a simple Map from base R if we have not copyied the 'y' object while creating 'w'
Map(function(u, v) {u[[v]] <- x[[v]]
u}, w, names(x))

How to build binary data.frame in R for multiple dimensions?

I have a dataframe with three factors of which two are binary and the third one is integer:
DATA YEAR1 YEAR2 REGION1 REGION2
OBS1 X 1 0 1 0
OBS2 Y 1 0 0 1
OBS3 Z 0 1 1 0
etc.
Now I want to transform it to something like this
YEAR1_REGION1 YEAR1_REGION2 YEAR2_REGION1 YEAR2_REGION2
OBS1 X 0 0 0
OBS2 0 Y 0 0
OBS3 0 0 Z 0
Basic matrix multiplication is not what I'm after. I would like to find a neat way to do this that would automatically have the columns renamed as well. My actual data has three factor dimensions with 20*8*6 observations so finally there will be 960 columns altogether.
Here's another approach based on outer and similar to #Roland answer.
year <- grep("YEAR", names(DF), value = TRUE)
region <- grep("REGION", names(DF), value = TRUE)
data <- as.character(DF$DATA)
df <- outer(year, region, function(x, y) DF[,x] * DF[,y])
colnames(df) <- outer(year, region, paste, sep = "_")
df <- as.data.frame(df)
for (i in seq_len(ncol(df)))
df[as.logical(df[,i]), i] <- data[as.logical(df[,i])]
df
## YEAR1_REGION1 YEAR2_REGION1 YEAR1_REGION2 YEAR2_REGION2
## OBS1 X 0 0 0
## OBS2 0 0 Y 0
## OBS3 0 Z 0 0
Maybe others will come up with a more succinct possibility, but this creates the expected result:
DF <- read.table(text=" DATA YEAR1 YEAR2 REGION1 REGION2
OBS1 X 1 0 1 0
OBS2 Y 1 0 0 1
OBS3 Z 0 1 1 0", header=TRUE)
DF[,-1] <- lapply(DF[,-1], as.logical)
DF[,1] <- as.character(DF[,1])
res <- apply(expand.grid(2:3, 4:5), 1, function(i) {
tmp <- rep("0", length(DF[,1]))
ind <- do.call(`&`,DF[,i])
tmp[ind] <- DF[ind,1]
tmp <- list(tmp)
names(tmp) <- paste0(names(DF)[i], collapse="_")
tmp
})
res <- as.data.frame(res)
rownames(res) <- rownames(DF)
# YEAR1_REGION1 YEAR2_REGION1 YEAR1_REGION2 YEAR2_REGION2
# OBS1 X 0 0 0
# OBS2 0 0 Y 0
# OBS3 0 Z 0 0
However, I suspect there is a much better possibility to achieve what you actually want to do, without creating a huge wide-format data.frame.

Resources