I'm relatively new to R and was struggling with potentially a very simple problem.
I have data that has multiple columns named in a similar way. Here is a sample data:
df = data.frame(PPID = 1:50,
time1 = sample(c(0,1), 50, replace = TRUE),
time2 = sample(c(0,1), 50, replace = TRUE),
time3 = sample(c(0,1), 50, replace = TRUE),
condition1 = sample(c(0:3), 50, replace = TRUE),
condition2 = sample(c(0:3), 50, replace = TRUE))
In my actual data, I have much more columns - approximately 50 for time and 10 for condition.
I want to multiply week columns and condition columns, e.g. in that sample data it should give me 6 extra columns, like: time1_condition1, time1_condition2, time2_condition1, time2_condition2, time3_condition1, time3_condition2.
I tried solutions that were suggested in this thread but they did not work (presumably because I didn't understand how mapply/apply worked and did not make appropriate changes) - it gave me error message that the longer argument is not a multiple of length of shorter.
Any help would be greatly appreciated!
#Get all the columns with "time" columns
time_cols <- grep("^time", names(df))
#Get all the columns with "condition" column
condition_cols <- grep("^condition", names(df))
#Multiply each "time" columns with all the condition columns
# and creating a new dataframe
new_df <- do.call("cbind", lapply(df[time_cols] , function(x) x *
df[condition_cols]))
#Combine both the dataframes
complete_df <- cbind(df,new_df)
We can also generate column names using expand.grid
new_names <- do.call("paste0",
expand.grid(names(df)[condition_cols], names(df)[time_cols]))
colnames(complete_df)[7:12] <- new_names
Here is a tidyverse alternative
library(tidyverse)
idx.time <- grep("time", names(df), value = T)
idx.cond <- grep("condition", names(df), value = T)
bind_cols(
df,
map_dfc(transpose(expand.grid(idx.time, idx.cond, stringsAsFactors = F)),
~setNames(data.frame(df[, .x$Var1] * df[, .x$Var2]), paste(.x$Var1, .x$Var2, sep = "_"))))
# PPID time1 time2 time3 condition1 condition2 time1_condition1
#1 1 1 0 1 3 0 3
#2 2 0 1 1 0 1 0
#3 3 0 1 1 0 2 0
#4 4 0 0 1 0 3 0
#5 5 0 0 0 0 3 0
#...
Explanation: expand.grid creates all pairwise combinations of idx.time and idx.cond. transpose turns a list/data.frame inside-out and returns a list, similar to apply(..., 1, as.list); map_dfc then operates on every element of that list and column-binds results.
Using
library(tidyverse)
a = df[grep("time",names(df))]
b = df[grep("condition",names(df))]
we can do:
map(a,~.x*b)%>%
bind_cols()%>%
set_names(paste(rep(names(a),each=ncol(b)),names(b),sep="_"))
or we can
cross2(a,b)%>%
map(lift(`*`))%>%
set_names(paste(rep(names(a),each=ncol(b)),names(b),sep="_"))%>%
data.frame()
time1_condition1 time2_condition1 time3_condition1 time1_condition2 time2_condition2 time3_condition2
1 3 0 3 2 0 2
2 3 3 0 1 1 0
3 0 0 0 0 0 0
4 3 3 0 0 0 0
5 0 0 2 0 0 1
6 0 0 1 0 0 1
7 2 2 0 0 0 0
Related
Does anyone have an idea how to generate column of random values where only one random row is marked with number "1". All others should be "0".
I need function for this in R code.
Here is what i need in photos:
df <- data.frame(subject = 1, choice = 0, price75 = c(0,0,0,1,1,1,0,1))
This command will update the choice column to contain a single random row with value of 1 each time it is called. All other rows values in the choice column are set to 0.
df$choice <- +(seq_along(df$choice) == sample(nrow(df), 1))
With integer(length(DF$choice)) a vector of 0 is created where [<- is replacing a 1 on the position from sample(length(DF$choice), 1).
DF <- data.frame(subject=1, choice="", price75=c(0,0,0,1,1,1,0,1))
DF$choice <- `[<-`(integer(nrow(DF)), sample(nrow(DF), 1L), 1L)
DF
# subject choice price75
#1 1 0 0
#2 1 0 0
#3 1 0 0
#4 1 1 1
#5 1 0 1
#6 1 0 1
#7 1 0 0
#8 1 0 1
> x <- rep(0, 10)
> x[sample(1:10, 1)] <- 1
> x
[1] 0 0 0 0 0 0 0 1 0 0
Many ways to set a random value in a row\column in R
df<-data.frame(x=rep(0,10)) #make dataframe df, with column x, filled with 10 zeros.
set.seed(2022) #set a random seed - this is for repeatability
#two base methods for sampling:
#sample.int(n=10, size=1) # sample an integer from 1 to 10, sample size of 1
#sample(x=1:10, size=1) # sample from 1 to 10, sample size of 1
df$x[sample.int(n=10, size=1)] <- 1 # randomly selecting one of the ten rows, and replacing the value with 1
df
I am wondering what an efficient approach to the following question would be:
Suppose I have three characters in group 1 and two characters in group 2:
group_1 = c("X", "Y", "Z")
group_2 = c("A", "B")
Clearly, the "all" possible combinations for group_1 and group_2 are given by:
group_1_combs = data.frame(X = c(0,1,0,0,1,1,0,1),
Y = c(0,0,1,0,1,0,1,1),
Z = c(0,0,0,1,0,1,1,1))
group_2_combs = data.frame(A = c(0,1,0,1),
B = c(0,0,1,1))
My question is the following:
(1) How do I go from group_1 to group_1_combs efficiently (given that the character vector might be large).
(2) How do I do an "all possible" combinations of each row of group_1_combs and group_2_combs? Specifically, I want a "final" data.frame where each row of group_1_combs is "permuted" with every row of group_2_combs. This means that the final data.frame would have 8 x 4 rows (since there are 8 rows in group_1_combs and 4 rows in group_2_combs) and 5 columns (X,Y,Z,A,B).
Thanks!
You want expand.grid and merge:
Question 1:
group_1_combs <- expand.grid(setNames(rep(list(c(0, 1)), length(group_1)), group_1))
group_2_combs <- expand.grid(setNames(rep(list(c(0, 1)), length(group_2)), group_2))
Question 2:
> merge(group_1_combs, group_2_combs)
X Y Z A B
1 0 0 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 1 1 0 0 0
5 0 0 1 0 0
6 1 0 1 0 0
7 0 1 1 0 0
...
Or you can go directly to the merged data.frame:
group_12 <- c(group_1, group_2)
expand.grid(setNames(rep(list(c(0, 1)), length(group_12)), group_12))
I have a dataframe containing a long list of binary variables. Each row represents a participant, and columns represent whether a participant made a certain choice (1) or not (0). For the sakes of simplicity, let's say there's only four binary variables and 6 participants.
df <- data.frame(a = c(0,1,0,1,0,1),
b = c(1,1,1,1,0,1),
c = c(0,0,0,1,1,1),
d = c(1,1,0,0,0,0))
>df
# a b c d
# 1 0 1 0 1
# 2 1 1 0 1
# 3 0 1 0 0
# 4 1 1 1 0
# 5 0 0 1 0
# 6 1 1 1 0
In the dataframe, I want to create a list of columns that reflect each unique combination of variables in df (i.e., abc, abd, bcd, cda). Then, for each row, I want to add value "1" if the row contains the particular combination corresponding to the column. So, if the participant scored 1 on "a", "b", and "c", and 0 on "d" he would have a score 1 in the newly created column "abc", but 0 in the other columns. Ideally, it would look something like this.
>df_updated
# a b c d abc abd bcd cda
# 1 0 1 0 1 0 0 0 0
# 2 1 1 0 1 0 1 0 0
# 3 0 1 0 0 0 0 0 0
# 4 1 1 1 0 1 0 0 0
# 5 0 0 1 0 0 0 0 0
# 6 1 1 1 0 0 0 0 0
The ultimate goal is to have an idea of the frequency of each of the combinations, so I can order them from the most frequently chosen to the least frequently chosen. I've been thinking about this issue for days now, but couldn't find an appropriate answer. I would very much appreciate the help.
Something like this?
funCombn <- function(data){
f <- function(x, data){
data <- data[x]
list(
name = paste(x, collapse = ""),
vec = apply(data, 1, function(x) +all(as.logical(x)))
)
}
res <- combn(names(df), 3, f, simplify = FALSE, data = df)
out <- do.call(cbind.data.frame, lapply(res, '[[', 'vec'))
names(out) <- sapply(res, '[[', 'name')
cbind(data, out)
}
funCombn(df)
# a b c d abc abd acd bcd
#1 0 1 0 1 0 0 0 0
#2 1 1 0 1 0 1 0 0
#3 0 1 0 0 0 0 0 0
#4 1 1 1 0 1 0 0 0
#5 0 0 1 0 0 0 0 0
#6 1 1 1 0 1 0 0 0
Base R option using combn :
n <- 3
cbind(df, do.call(cbind, combn(names(df), n, function(x) {
setNames(data.frame(as.integer(rowSums(df[x] == 1) == n)),
paste0(x, collapse = ''))
}, simplify = FALSE))) -> result
result
# a b c d abc abd acd bcd
#1 0 1 0 1 0 0 0 0
#2 1 1 0 1 0 1 0 0
#3 0 1 0 0 0 0 0 0
#4 1 1 1 0 1 0 0 0
#5 0 0 1 0 0 0 0 0
#6 1 1 1 0 1 0 0 0
Using combn create all combinations of column names taking n columns at a time. For each of those combinations assign 1 to those rows where all the 3 combinations are 1 or 0 otherwise.
If you are just looking for a frequency of the combinations (and they don't need to be back in the original data), then you could use something like this:
df <- data.frame(a = c(0,1,0,1,0,1),
b = c(1,1,1,1,0,1),
c = c(0,0,0,1,1,1),
d = c(1,1,0,0,0,0))
n <- names(df)
out <- sapply(n, function(x)ifelse(df[[x]] == 1, x, ""))
combs <- apply(out, 1, paste, collapse="")
sort(table(combs))
# combs
# abd b bd c abc
# 1 1 1 1 2
Ok, so let's use your data, including one row without any 1's:
df <- data.frame(
a = c(0,1,0,1,0,1,0),
b = c(1,1,1,1,0,1,0),
c = c(0,0,0,1,1,1,0),
d = c(1,1,0,0,0,0,0)
)
Now I want to paste all column names together if they have a 1, and then make that a wide table (so that all have a column for a combination). Of course, I fill all resulting NAs with 0's.
df2 <- df %>%
dplyr::mutate(
combination = paste0(
ifelse(a == 1, "a", ""), # There is possibly a way to automate this as well using across()
ifelse(b == 1, "b", ""),
ifelse(c == 1, "c", ""),
ifelse(d == 1, "d", "")
),
combination = ifelse(
combination == "",
"nothing",
paste0("comb_", combination)
),
value = ifelse(
is.na(combination),
0,
1
),
i = dplyr::row_number()
) %>%
tidyr::pivot_wider(
names_from = combination,
values_from = value,
names_repair = "unique"
) %>%
replace(., is.na(.), 0) %>%
dplyr::select(-i)
Since you want to order the original df by frequency, you can create a summary of all combinations (excluding those without anything filled in). Then you just make it a long table and pull the column for every combination (arranged by frequency) from the table.
comb_in_order <- df2 %>%
dplyr::select(
-tidyselect::any_of(
c(
names(df),
"nothing" # I think you want these last.
)
)
) %>%
dplyr::summarise(
dplyr::across(
.cols = tidyselect::everything(),
.fns = sum
)
) %>%
tidyr::pivot_longer(
cols = tidyselect::everything(),
names_to = "combination",
values_to = "frequency"
) %>%
dplyr::arrange(
dplyr::desc(frequency)
) %>%
dplyr::pull(combination)
The only thing to do then is to reconstruct the original df by these after arranging by the columns.
df2 %>%
dplyr::arrange(
across(
tidyselect::any_of(comb_in_order),
desc
)
) %>%
dplyr::select(
tidyselect::any_of(names(df))
)
This should work for all possible combinations.
I would like to add a varying number (X) of columns with 0 to an existing data.frame within a function.
Here is an example data.frame:
dt <- data.frame(x=1:3, y=4:6)
I would like to get this result if X=1 :
a x y
1 0 1 4
2 0 2 5
3 0 3 6
And this if X=3 :
a b c x y
1 0 0 0 1 4
2 0 0 0 2 5
3 0 0 0 3 6
What would be an efficient way to do this?
We can assign multiple columns to '0' based on the value of 'X'
X <- 3
nm1 <- names(dt)
dt[letters[seq_len(X)]] <- 0
dt[c(setdiff(names(dt), nm1), nm1)]
Also, we can use add_column from tibble and create columns at a specific location
library(tibble)
add_column(dt, .before = 1, !!!setNames(as.list(rep(0, X)),
letters[seq_len(X)]))
A second option is cbind
f <- function(x, n = 3) {
cbind.data.frame(matrix(
0,
ncol = n,
nrow = nrow(x),
dimnames = list(NULL, letters[1:n])
), x)
}
f(dt, 5)
# a b c d e x y
#1 0 0 0 0 0 1 4
#2 0 0 0 0 0 2 5
#3 0 0 0 0 0 3 6
NOTE: because letters has a length of 26 the function would need some adjustment regarding the naming scheme if n > 26.
You can try the code below
dt <- cbind(`colnames<-`(t(rep(0,X)),letters[seq(X)]),dt)
If you don't care the column names of added columns, you can use just
dt <- cbind(t(rep(0,X)),dt)
which is much shorter
After merging a dataframe with another im left with random NA's for the occasional row. I'd like to set these NA's to 0 so I can perform calculations with them.
Im trying to do this with:
bothbeams.data = within(bothbeams.data, {
bothbeams.data$x.x = ifelse(is.na(bothbeams.data$x.x) == TRUE, 0, bothbeams.data$x.x)
bothbeams.data$x.y = ifelse(is.na(bothbeams.data$x.y) == TRUE, 0, bothbeams.data$x.y)
})
Where $x.x is one column and $x.y is the other of course, but this doesn't seem to work.
You can just use the output of is.na to replace directly with subsetting:
bothbeams.data[is.na(bothbeams.data)] <- 0
Or with a reproducible example:
dfr <- data.frame(x=c(1:3,NA),y=c(NA,4:6))
dfr[is.na(dfr)] <- 0
dfr
x y
1 1 0
2 2 4
3 3 5
4 0 6
However, be careful using this method on a data frame containing factors that also have missing values:
> d <- data.frame(x = c(NA,2,3),y = c("a",NA,"c"))
> d[is.na(d)] <- 0
Warning message:
In `[<-.factor`(`*tmp*`, thisvar, value = 0) :
invalid factor level, NA generated
It "works":
> d
x y
1 0 a
2 2 <NA>
3 3 c
...but you likely will want to specifically alter only the numeric columns in this case, rather than the whole data frame. See, eg, the answer below using dplyr::mutate_if.
A solution using mutate_all from dplyr in case you want to add that to your dplyr pipeline:
library(dplyr)
df %>%
mutate_all(funs(ifelse(is.na(.), 0, .)))
Result:
A B C
1 0 0 0
2 1 0 0
3 2 0 2
4 3 0 5
5 0 0 2
6 0 0 1
7 1 0 1
8 2 0 5
9 3 0 2
10 0 0 4
11 0 0 3
12 1 0 5
13 2 0 5
14 3 0 0
15 0 0 1
If in any case you only want to replace the NA's in numeric columns, which I assume it might be the case in modeling, you can use mutate_if:
library(dplyr)
df %>%
mutate_if(is.numeric, funs(ifelse(is.na(.), 0, .)))
or in base R:
replace(is.na(df), 0)
Result:
A B C
1 0 0 0
2 1 <NA> 0
3 2 0 2
4 3 <NA> 5
5 0 0 2
6 0 <NA> 1
7 1 0 1
8 2 <NA> 5
9 3 0 2
10 0 <NA> 4
11 0 0 3
12 1 <NA> 5
13 2 0 5
14 3 <NA> 0
15 0 0 1
Update
with dplyr 1.0.0, across is introduced:
library(dplyr)
# Replace `NA` for all columns
df %>%
mutate(across(everything(), ~ ifelse(is.na(.), 0, .)))
# Replace `NA` for numeric columns
df %>%
mutate(across(where(is.numeric), ~ ifelse(is.na(.), 0, .)))
Data:
set.seed(123)
df <- data.frame(A=rep(c(0:3, NA), 3),
B=rep(c("0", NA), length.out = 15),
C=sample(c(0:5, NA), 15, replace = TRUE))
You can use replace_na() from tidyr package
df %>% replace_na(list(column1 = 0, column2 = 0)
To add to James's example, it seems you always have to create an intermediate when performing calculations on NA-containing data frames.
For instance, adding two columns (A and B) together from a data frame dfr:
temp.df <- data.frame(dfr) # copy the original
temp.df[is.na(temp.df)] <- 0
dfr$C <- temp.df$A + temp.df$B # or any other calculation
remove('temp.df')
When I do this I throw away the intermediate afterwards with remove/rm.
If you only want to replace NAs with 0s for a few select columns you also use an lapply solution, e.g:
data = data.frame(
one = c(NA,0),
two = c(NA,NA),
three = c(1,2),
four = c("A",NA)
)
data[1:2] = lapply(data[1:2],function(x){
x[is.na(x)] = 0
return(x)
})
data
Why not try this
na.zero <- function (x) {
x[is.na(x)] <- 0
return(x)
}
na.zero(df)