I have a dataframe with three factors of which two are binary and the third one is integer:
DATA YEAR1 YEAR2 REGION1 REGION2
OBS1 X 1 0 1 0
OBS2 Y 1 0 0 1
OBS3 Z 0 1 1 0
etc.
Now I want to transform it to something like this
YEAR1_REGION1 YEAR1_REGION2 YEAR2_REGION1 YEAR2_REGION2
OBS1 X 0 0 0
OBS2 0 Y 0 0
OBS3 0 0 Z 0
Basic matrix multiplication is not what I'm after. I would like to find a neat way to do this that would automatically have the columns renamed as well. My actual data has three factor dimensions with 20*8*6 observations so finally there will be 960 columns altogether.
Here's another approach based on outer and similar to #Roland answer.
year <- grep("YEAR", names(DF), value = TRUE)
region <- grep("REGION", names(DF), value = TRUE)
data <- as.character(DF$DATA)
df <- outer(year, region, function(x, y) DF[,x] * DF[,y])
colnames(df) <- outer(year, region, paste, sep = "_")
df <- as.data.frame(df)
for (i in seq_len(ncol(df)))
df[as.logical(df[,i]), i] <- data[as.logical(df[,i])]
df
## YEAR1_REGION1 YEAR2_REGION1 YEAR1_REGION2 YEAR2_REGION2
## OBS1 X 0 0 0
## OBS2 0 0 Y 0
## OBS3 0 Z 0 0
Maybe others will come up with a more succinct possibility, but this creates the expected result:
DF <- read.table(text=" DATA YEAR1 YEAR2 REGION1 REGION2
OBS1 X 1 0 1 0
OBS2 Y 1 0 0 1
OBS3 Z 0 1 1 0", header=TRUE)
DF[,-1] <- lapply(DF[,-1], as.logical)
DF[,1] <- as.character(DF[,1])
res <- apply(expand.grid(2:3, 4:5), 1, function(i) {
tmp <- rep("0", length(DF[,1]))
ind <- do.call(`&`,DF[,i])
tmp[ind] <- DF[ind,1]
tmp <- list(tmp)
names(tmp) <- paste0(names(DF)[i], collapse="_")
tmp
})
res <- as.data.frame(res)
rownames(res) <- rownames(DF)
# YEAR1_REGION1 YEAR2_REGION1 YEAR1_REGION2 YEAR2_REGION2
# OBS1 X 0 0 0
# OBS2 0 0 Y 0
# OBS3 0 Z 0 0
However, I suspect there is a much better possibility to achieve what you actually want to do, without creating a huge wide-format data.frame.
Related
I have a data frame with several binary variables: x1, x2, ... x100. I want to replace the entry 1 in each column with the number in the name of the column, i.e.:
data$x2[data$x2 == 1] <- 2
data$x3[data$x3 == 1] <- 3
data$x4[data$x4 == 1] <- 4
data$x5[data$x5 == 1] <- 5
...
How can I achieve this in a loop?
Using col:
# example data
set.seed(1); d <- as.data.frame(matrix(sample(0:1, 12, replace = TRUE), nrow = 3))
names(d) <- paste0("x", seq(ncol(d)))
d
# x1 x2 x3 x4
# 1 0 0 0 1
# 2 1 1 0 0
# 3 0 0 1 0
ix <- d == 1
d[ ix ] <- col(d)[ ix ]
d
# x1 x2 x3 x4
# 1 0 0 0 4
# 2 1 2 0 0
# 3 0 0 3 0
dplyr approach (using zx8754's data):
library(dplyr)
d %>%
mutate(across(starts_with('x'), ~ . * as.numeric(gsub('x', '', cur_column()))))
#> x1 x2 x3 x4
#> 1 0 0 0 4
#> 2 1 2 0 0
#> 3 0 0 3 0
Created on 2021-05-26 by the reprex package (v2.0.0)
Here is a base R solution with a lapply loop.
data[-1] <- lapply(names(data)[-1], function(k){
n <- as.integer(sub("[^[:digit:]]*", "", k))
data[data[[k]] == 1, k] <- n
data[[k]]
})
data
Test data.
set.seed(2021)
data <- replicate(6, rbinom(10, 1, 0.5))
data <- as.data.frame(data)
names(data) <- paste0("x", 1:6)
A solution based on a simple for loop is below (otherwise similar to the accepted answer using lapply):
for (i in 2:100) {
k <- paste0('x', i)
data[data[[k]] == 1, k] <- i
}
I would like to write a loop or apply in R to overwrite only certain values of a variable based on a condition. Here is an example data frame:
df <- data.frame(
state = c("MA","CO","TX"),
random_numeric = c(26,28,4),
var1 = c(3,0,0),
var2 = c(3,1,5),
var3 = c(0,1,0),
prelim_row_sum = c(6,2,5)
)
df
state random_numeric var1 var2 var3 prelim_row_sum
1 MA 26 3 3 0 6
2 CO 28 0 1 1 2
3 TX 4 0 5 0 5
In df, I would like to replace only the first value in var1, var2, or var3 to zero if it equals half prelim_row_sum. Thus, a correct loop or apply would replace only the first three and first 1 to zero. I have the random_numeric and state variables in the example data frame to show that there other character and numeric variables in my larger data frame. Accordingly, a dplyr solution with across would not work for me. I could, of course, do this one-by-one:
df[1,3] <- 0
df[2,4] <- 0
df$final_row_sum = rowSums(df[3:5])
df
state random_numeric var1 var2 var3 prelim_row_sum final_row_sum
1 MA 26 0 3 0 6 3
2 CO 28 0 0 1 2 1
3 TX 4 0 5 0 5 5
But I would really appreciate help with a loop, apply, or a function, so that I can do this on larger, non-stylized data frames. Thank you!
Here is one way to do with with apply. I allowed for a little bit of generality in that you must input which columns to apply this function to. At the end there is a wrapper function so you can set those values to zero and then create the final_row_sum column.
state = c("MA","CO","TX"),
random_numeric = c(26,28,4),
var1 = c(3,0,0),
var2 = c(3,1,5),
var3 = c(0,1,0),
prelim_row_sum = c(6,2,5)
)
my_func <- function(x){
value_to_zero <- which(
x[1:(length(x)-1)] == (x[length(x)]/2)
)
if(length(value_to_zero) > 0){
x[value_to_zero[1]] <- 0
}
return(x)
}
new_df <- df
cols_to_fix <- c("var1", "var2", "var3", "prelim_row_sum")
new_df[,cols_to_fix] <- t(
apply(
new_df[,cols_to_fix],
1,
my_func
)
)
new_df$final_row_sum <- rowSums(new_df[,cols_to_fix[-length(cols_to_fix)]])
new_df
state random_numeric var1 var2 var3 prelim_row_sum final_row_sum
1 MA 26 0 3 0 6 3
2 CO 28 0 0 1 2 1
3 TX 4 0 5 0 5 5
all_in_one <- function(x, cols){
my_func <- function(x){
value_to_zero <- which(
x[1:(length(x)-1)] == (x[length(x)]/2)
)
if(length(value_to_zero) > 0){
x[value_to_zero[1]] <- 0
}
return(x)
}
x[,cols] <- t(
apply(
x[,cols],
1,
my_func
)
)
x$final_row_sum <- rowSums(x[,cols[-length(cols)]])
return(x)
}
answer <- all_in_one(df, c("var1", "var2", "var3", "prelim_row_sum"))
state random_numeric var1 var2 var3 prelim_row_sum final_row_sum
1 MA 26 0 3 0 6 3
2 CO 28 0 0 1 2 1
3 TX 4 0 5 0 5 5
I have a more function based, tidyverse answer. It will return a data frame of var1, var2, and var3. You can easily combine that with the origin data frame. I like the origin Tsai's response, but I think this is a bit easily to understand and is more flexible.
library(tidyvese) # you really just need purrr
f <- function(var1, var2, var3, prelim_row_sum, ...) {
cols <- c(var1, var2, var3)
index <- which((cols * 2) == prelim_row_sum)[1]
assign(paste0("var", index), 0)
data.frame(var1=var1, var2=var2, var3=var3)
}
pmap_dfr(df, f)
Try this
cbind(df[1:2], t(apply(df[-(1:2)], 1, function(x){
x[which.max(x == x[4]/2)] <- 0
c(x, final_row_sum = sum(x[-4]))
})))
# state random_numeric var1 var2 var3 prelim_row_sum final_row_sum
# 1 MA 26 0 3 0 6 3
# 2 CO 28 0 0 1 2 1
# 3 TX 4 0 5 0 5 5
I want to create a matrix and assign 1s based on matching the rownames of the matrix to the character vector.
## Here is the small example matrix
x <- as.character(c("rm78", "mn05", "hg78"))
y <- as.character(c("JU67", "EX56", "abcd", "rm78", "xyh56", "def", "terr6572"))
z <- as.character(c("abcd", "rh990", "mn05", "rm78", "xyh56", "efdg", "bett72"))
common <- Reduce(union, list(x,y,z))
dat.names <- c("x", "y", "z")
mat0 <- matrix(0, nrow = length(common), ncol = length(dat.names))
colnames(mat0) <- dat.names
rownames(mat0) <- common
mat0
If the character vectors x, y, and z matches the rownames of the matrix mat0 then assign 1 to the corresponding value in the matrix.
I am doing this individually for each vector and adding values to the matrix. I have a list of more than 12 such vectors and doing this way would be redundant. I think there may be a much efficient way of doing this.
for(i in rownames(mat0)[rownames(mat0) %in% x])
{
# first column
mat0[i , 1] <- 1
}
for(i in rownames(mat0)[rownames(mat0) %in% y])
{
# second column
mat0[i , 2] <- 1
}
for(i in rownames(mat0)[rownames(mat0) %in% z])
{
# third column
mat0[i , 3] <- 1
}
Yes, you don't need multiple loops. In fact, you don't need any:
mat0[] <- do.call(cbind, lapply(list(x, y, z), function(i) +(rownames(mat0) %in% i)))
mat0
#> x y z
#> rm78 1 1 1
#> mn05 1 0 1
#> hg78 1 0 0
#> JU67 0 1 0
#> EX56 0 1 0
#> abcd 0 1 1
#> xyh56 0 1 1
#> def 0 1 0
#> terr6572 0 1 0
#> rh990 0 0 1
#> efdg 0 0 1
#> bett72 0 0 1
I have a data set that looks as follows
df <- data.frame( name = c("a", "b", "c"),
judgement1= c(5, 0, NA),
judgement2= c(1, 1, NA),
judgement3= c(2, 1, NA))
I want to reshape the dataframe to look like this
# name judgement1 judgement2 judgement3
# a 1 0 0
# a 1 0 0
# a 1 0 0
# a 1 0 0
# a 1 0 0
# b 1 0 0
# b 0 1 0
# b 0 0 1
And so on. I have seen that untable is recommended on some other threads, but it does not appear to work with the current version of r. Is there a package that can convert summarised counts into individual observations?
You could try something like this:
df <- data.frame( name = c("a", "b", "c"),
judgement1= c(5, 0, NA),
judgement2= c(1, 1, NA),
judgement3= c(2, 1, NA))
rep.vec <- colSums(df[colnames(df) %in% paste0("judgement", (1:nrow(df)), sep="")], na.rm = TRUE)
want <- data.frame(name=df$name, cbind(diag(nrow(df))))
colnames(want)[-1] <- paste0("judgement", (1:nrow(df)), sep="")
(want <- want[rep(1:nrow(want), rep.vec), ])
I wrote a function that works to give you your desired output:
untabl <- function(df, id.col, count.cols) {
df[is.na(df)] <- 0 # replace NAs
out <- lapply(count.cols, function(x) { # for each column with counts
z <- df[rep(1:nrow(df), df[,x]), ] # replicate rows
z[, -c(id.col)] <- 0 # set all other columns to zero
z[, x] <- 1 # replace the count values with 1
z
})
out <- do.call(rbind, out) # combine the list
out <- out[order(out[,c(id.col)]),] # reorder (you can change this)
rownames(out) <- NULL # return to simple row numbers
out
}
untabl(df = df, id.col = 1, count.cols = c(2,3,4))
# name judgement1 judgement2 judgement3
#1 a 1 0 0
#2 a 1 0 0
#3 a 1 0 0
#4 a 1 0 0
#5 a 1 0 0
#6 a 0 1 0
#7 b 0 1 0
#8 a 0 0 1
#9 a 0 0 1
#10 b 0 0 1
And for your reference, reshape::untable consists of the following code:
function (df, num)
{
df[rep(1:nrow(df), num), ]
}
I'm trying to create a random data set in R that has metric, binomial and categorical variables. However, in the end when I check the class of my categorical variables R says they are numeric, but I need them to be factors for my further analysis. Does anybody have an idea what I'm doing wrong here?
that's my code:
set.seed(3456)
R.dat <- function(n = 5000,metr=1,bin=1,cat=3) {
j <- metr
X <- (matrix(0,n,j))
for (i in 1:n) {
X[i,] <- rnorm(j, mean = 0, sd = 1)
}
BIN <- matrix(0,n,bin)
for (i in 1:bin) {
BIN[,i] <- rbinom(n,1, 0.5)
}
CAT <- matrix(0,n,cat)
for (i in 1:cat) {
CAT[,i] <- factor(sample(1:4, n, TRUE))
}
X <- as.data.frame(cbind(X,BIN, CAT))
return(X)
}
Dat <- R.dat(n=5000,metr=1,bin=1, cat=3)
summary(Dat)
If I just sample like this:
x <- factor(sample(1:4, n, TRUE))
class(x)
it says x is a factor, so I don't get why it doesn't do the same when I use it in the function and loop...any help is much apprecciated, thanks in advance!
When you do this:
CAT <- matrix(0,n,cat)
for (i in 1:cat) {
CAT[,i] <- factor(sample(1:4, n, TRUE))
}
you create a numeric matrix CAT, and then you assign a new value to a subset of that matrix. When you do that assignment, the new value is coerced to the type of CAT, which is numeric.
Also, when you cbind the matrices X, BIN and CAT at the end, you coerce all of them to a common type. This would again mess up your variable types, even assuming everything was working correctly up to this point.
The rest of your code can also be simplified considerably. In particular, you don't need looping to reassign values to matrices; you can call the matrix constructor function directly on a vector of values.
Try this instead:
R.dat <- function(n=5000, metr=1, bin=1, cat=3)
{
X <- matrix(rnorm(n * metr), nrow=n)
B <- matrix(rbinom(n * bin, 1, 0.5), nrow=n)
F <- matrix(as.character(sample(1:4, n * cat, TRUE)), nrow=n)
data.frame(X=X, B=B, F=F)
}
You don't need a loop, If you switch to data.table, you can generate them by reference.
library(data.table)
n <- 10
bin <- 1
DT <- data.table(X=replicate(n, rnorm(bin, mean=0, sd = 1)),
BIN = rbinom(n,1, 0.5),
CAT = factor(sample(1:4, n, TRUE)))
## If you need you can add more columns
cols <- paste0("CAT", 1:3)
DT[, (cols):= lapply(rep(n, 3) ,rbinom, 1, .5) ]
cols <- paste0("BIN", 1:3)
DT[, (cols):= lapply(rep(n, 3) ,function(x){factor(sample(1:4, n, TRUE)) }) ]
DT
lapply(DT, class)
DT
X BIN CAT CAT1 CAT2 CAT3 BIN1 BIN2 BIN3
1: 1.2934720 1 2 0 0 0 1 1 2
2: -0.1183180 1 2 0 0 1 3 3 1
3: 0.3648810 1 2 1 1 1 3 2 3
4: -0.2149963 1 2 1 1 0 2 3 2
5: 0.3204577 1 1 0 1 1 2 2 4
6: -0.5941640 0 4 1 0 0 2 3 1
7: -1.8852835 1 4 1 0 0 2 1 1
8: -0.8329852 0 2 0 0 1 1 1 2
9: -0.1353628 0 4 0 1 1 1 4 1
10: -0.2943969 1 4 0 1 0 4 3 3
> lapply(DT, class)
$X
[1] "numeric"
$BIN
[1] "integer"
$CAT
[1] "factor"
$CAT1
[1] "integer"
$CAT2
[1] "integer"
$CAT3
[1] "integer"
$BIN1
[1] "factor"
$BIN2
[1] "factor"
$BIN3
[1] "factor"
Because matrix does not accept factor vector, it will be coerced into numbers.
Just change it into a dataframe :
CAT <- matrix(0,n,cat)
CAT <- as.data.frame(CAT)
This will do the trick.