Goals: To merge multiple columns just based on the similarity of the column name.
Issues: I am dealing with a large data set where the column names are replicated and look like this: wk1.1, wk1.2, wk1.3. For each row, there will be only one value in the similar column names, and the others will be NA. Coalesce is very helpful, but becomes tedious (messes up automation) when I have to list each column name. Is there a way to coalesce based off a string of characters? For instance below, I would prefer to coalesce %in% "wk1."
library(dplyr)
wk1.1 <- c(15, 4, 1)
wk1.2 <- c(3, 20, 4)
wk1.3 <- c(1, 2, 17)
df <- data.frame(wk1.1, wk1.2, wk1.3)
df[df < 14] <- NA
df1 <- df %>%
mutate(wk1 = coalesce(df$wk1.1, df$wk1.2, df$wk1.3))
We can use splice it with !!!
library(dplyr)
df %>%
mutate(wk1 = coalesce(!!! .))
# wk1.1 wk1.2 wk1.3 wk1
#1 15 NA NA 15
#2 NA 20 NA 20
#3 NA NA 17 17
Or another option is to reduce and apply coalesce
library(purrr)
df %>%
mutate(wk1 = reduce(., coalesce))
Related
I need to operate columns based on their name condition. In the following reproducible example, per each column that ends with 'x', I create a column that multiplies by 2 the respective variable:
library(dplyr)
set.seed(8)
id <- seq(1,700, by = 1)
a1_x <- runif(700, 0, 10)
a1_y <- runif(700, 0, 10)
a2_x <- runif(700, 0, 10)
df <- data.frame(id, a1_x, a1_y, a2_x)
#Create variables manually: For every column that ends with X, I need to create one column that multiplies the respective column by 2
df <- df %>%
mutate(a1_x_new = a1_x*2,
a2_x_new = a2_x*2)
Since I'm working with several columns, I need to automate this process. Does anybody know how to achieve this? Thanks in advance!
Try this:
df %>% mutate(
across(ends_with("x"), ~ .x*2, .names = "{.col}_new")
)
Thanks #RicardoVillalba for correction.
You could use transmute and across to generate the new columns for those column names ending in "x". Then, use rename_with to add the "_new" suffix and bind_cols back to the original data frame.
library(dplyr)
df <- df %>%
transmute(across(ends_with("x"), ~ . * 2)) %>%
rename_with(., ~ paste0(.x, "_new")) %>%
bind_cols(df, .)
Result:
head(df)
id a1_x a1_y a2_x a1_x_new a2_x_new
1 1 4.662952 0.4152313 8.706219 9.325905 17.412438
2 2 2.078233 1.4834044 3.317145 4.156466 6.634290
3 3 7.996580 1.4035441 4.834126 15.993159 9.668252
4 4 6.518713 7.0844794 8.457379 13.037426 16.914759
5 5 3.215092 3.5578827 8.196574 6.430184 16.393149
6 6 7.189275 5.2277208 3.712805 14.378550 7.425611
I'm trying to replace binary information in dataframe columns with strings that refer to the columns' names.
My data looks like this (just with more natXY columns and some additional variables):
df <- data.frame(id = c(1:5), natAB = c(1,0,0,0,1), natCD = c(0,1,0,0,0), natother = c(0,0,1,1,0), var1 = runif(5, 1, 10))
df
All column names in question start with "nat", mostly followed by two letters although some contain a different number of characters.
For a single column, the following code achieves the desired outcome:
df %>% mutate(natAB = ifelse(natAB == 1, "AB", NA)) -> df
Now I need to generalise this line in order to apply it to the other columns using the mutate() and across() functions.
I imagine something like this
df %>% mutate(across(natAB:natother, ~ ifelse(
. == 1, paste(substr(colnames(.), start = 4, stop = nchar(colnames(.)))), NA))) -> df
... but end up with all my "nat" columns filled with NA. How do I reference the column name correctly in this code structure?
Any help is much appreciated.
You can use cur_column to refer to the column name in an across call, and then use str_remove:
library(stringr)
library(dplyr)
df %>%
mutate(across(natAB:natother,
~ ifelse(.x == 1, str_remove(cur_column(), "nat"), NA)))
# id natAB natCD natother var1
# 1 1 AB <NA> <NA> 7.646891
# 2 2 <NA> CD <NA> 4.704543
# 3 3 <NA> <NA> other 7.717925
# 4 4 <NA> <NA> other 3.367320
# 5 5 AB <NA> <NA> 8.455011
I am attempting to use the lsa::cosine function to derive cosine values between vectors distributed across successive rows of a dataframe. My raw dataframe is structured with 15 numeric columns with each row denoting a unique vector
each row is a 15-item vector
My challenge is to create a new variable (e.g., cosineraw) that reflects cosine(vec1, vec2). Vec1 is the vector for Row1 and Vec2 is the vector for the next row (lead). I need this function to loop over rows for very large dataframes and am attempting to avoid a for loop. Essentially I need to compute a cosine value for each row contrasted to the next row stopping at the second to last row of the dataframe (since there is no cosine value for the last observation).
I've tried selecting observations rowwise:
dat <- mydat %>% rowwise %>% mutate(cosraw = cosine(as.vector(t(select_all))), as.vector(t(lead(select_all))))
but am getting an 'argument is not a matrix' error
In isolation, this code snippet works:
maybe <- lsa::cosine(as.vector(t(dat[2,])), as.vector(t(dat[1,])))
The problem is that the row index must be relative. This only works successfully for row1 vs. row2 not as the basis for a function rolling across all rows.
Is there a way to do this avoiding a 'for' loop?
Here's a base R solution:
# Load {lsa}
library(lsa)
# Generate data with 250k rows and 300 columns
gen_list <- lapply(1:250000, function(i){
rnorm(300)
})
# Convert to matrix
mat <- t(simplify2array(gen_list))
# Obtain desired values
vals <- unlist(
lapply(
2:nrow(mat), function(i){
cosine(mat[i-1,], mat[i,])
}
)
)
You can ignore the gen_list code as this was to generate example data.
You will want to convert your data frame to a matrix to make it compatible with the {lsa} package.
Runs quickly -- 3.39 seconds on my computer
My answer is similar to Kat's, but I firstly packaged the 15 row values into a list and then created a new column with leading list of lists.
Here is a reproducible data
library(dplyr)
library(tidyr)
library(lsa)
set.seed(1)
df <- data.frame(replicate(15,runif(10)))
The actual workflow:
df %>%
rowwise %>%
summarise(row_v = list(c_across())) %>%
mutate(nextrow_v = lead(row_v)) %>%
replace_na(list(nextrow_v=list(rep(NA, 15)))) %>% # replace NA with a list of NAs
rowwise %>%
summarise(cosr = cosine(unlist(row_v), unlist(nextrow_v)))
# A tibble: 10 x 1
# Rowwise:
cosr[,1]
<dbl>
1 0.820
2 0.791
3 0.780
4 0.785
5 0.838
6 0.808
7 0.718
8 0.743
9 0.773
10 NA
I'm assuming that you aren't looking for vectorization, as well (i.e., lapply or map).
This works, but it's a bit cumbersome. I didn't have any actual data from you so I made my own.
library(lsa)
library(tidyverse)
set.seed(1)
df1 <- matrix(sample(rnorm(15 * 11, 1, .1), 15 * 10), byrow = T, ncol = 15)
Then I created a copy of the data to use as the lead, because for the mutate to work, you need to lead columnwise, but aggregate rowwise. (That doesn't sound quite right, but hopefully, you can make heads or tails of it.)
df2 <- df1
df3 <- df2[-1, ] # all but the first row
df3 <- rbind(df3, rep(NA, 15)) # fill the missing row with NA
df2 <- cbind(df2, df3) %>% as.data.frame()
So now I've got a data frame that is 30 columns wide. the first 15 are my vector; the second 15 is the lead.
df2 %>%
rowwise %>%
mutate(cosr = cosine(c_across(V1:V15), c_across(V16:V30))) %>%
select(cosr) %>% unlist()
# cosr1 cosr2 cosr3 cosr4 cosr5 cosr6 cosr7 cosr8
# 0.9869402 0.9881976 0.9932426 0.9921418 0.9946119 0.9917792 0.9908216 0.9918681
# cosr9 cosr10
# 0.9972666 NA
If in doubt, you can always use a loop or vectorization to validate the numbers.
for(i in 1:(nrow(df1) - 1)) {
v1 <- df1[i, ] %>% unlist()
v2 <- df1[i + 1, ] %>% unlist()
message(cosine(v1, v2))
}
invisible(
lapply(1:(nrow(df1) - 1),
function(i) {message(cosine(unlist(df1[i, ]),
unlist(df1[i + 1, ])))}))
I want to transpose the first two rows into two new columns, and remain the rest of data frame. How do I do it in R?
My original data
A <- c("2012","PL",3,2)
B <- c("2012","PL",6,1)
C <- c("2012","PL",7,4)
DF <- data.frame(A,B,C)
My final data after transpose
V1 <- c("2012","2012")
V2 <- c("PL","PL")
A <- c(3,2)
B <- c(6,1)
C <- c(7,4)
DF <- data.frame(V1,V2,A,B,C)
Where V1 and V2 are the names for new columns and they are created automatically.
Thank you for any assistance.
Base R:
cbind(t(DF[1:2, 1, drop=FALSE]), DF[-(1:2),])
# Warning in data.frame(..., check.names = FALSE) :
# row names were found from a short variable and have been discarded
# 1 2 A B C
# 1 2012 PL 3 6 7
# 2 2012 PL 2 1 4
though I have some concerns about the apparent key property of "2012" and "PL". That is, you start with three instances of each and end with two. Logically it makes sense, though really to me it looks as if you have a matrix of numbers associated with a single "2012","PL", but perhaps that's not how the data is coming to you. (If you can change the format of the data before getting to this point such that you have a matrix and its associated keys, then it might make data munging more direct, declarative, and resistant to bugs.)
Here is an option with slice
library(dplyr)
DF %>%
select(A) %>%
slice(1:2) %>%
t %>%
as.data.frame %>%
bind_cols(DF %>%
slice(-(1:2)))
I'm attempting to replace empty values in column z based on the values in column x.
I've used filter() to narrow down to the rows of importance, and apply mutate() afterwards, but the mutate values are not replaced in the original dataframe. I can store it as a new dataframe, but merging afterwards would be a considerable headaches as this is happening across dozens of conditionals.
make dummy data
xx <- data.frame(x = c(1,2,3), y = c("a","","c"), z=c(5,5,""))
xx %>% filter(x == 3) %>% # filter to value of interest
filter(z == "") %>% # filter to NA values to be replaced
mutate(z = replace(z, z =="", 5) ) # mutate to replace NA value
if i do:
xx <- xx %>% filter(x == 3) %>% # filter to value of interest
filter(z == "") %>% # filter to NA values to be replaced
mutate(z = replace(z, z =="", 5) ) # mutate to replace NA value
then only the single row is stored...
I'm looking for a way to keep all of the other dataframe data but replace the mutated data.
Feels like it should be a quick fix, but been stuck on it for a while..
You can use an ifelse() statement within dplyr::mutate().
df <- data.frame(x=sample(1:10,100,T),
y=sample(c(NA,1:5),100,T))
df %>% mutate(y=ifelse(is.na(y),x,y))
x y
1 7 7
2 10 3
3 7 1
4 7 1
5 10 4
6 3 3
...