this is my first post here :)
So I encountered some weird behavior today: When using the dplyr mutate function together with the paste function, the outcome is the same for every row.
Here is an example:
vec1 <- c(2, 5)
vec2 <- c(4, 6)
test_df <- data.frame(vec1, vec2)
test_df %>% mutate(new_col = paste(vec1:vec2, collapse = ","))
with the output
vec1 vec2 new_col
1 2 4 2,3,4
2 5 6 2,3,4
but thats not what I wanted or expected.
Here is what I wanted, achieved with a loop:
df <- test_df %>% mutate(new_col = 1)
for(i in 1:nrow(test_df)){
df$new_col[i] <- paste(df$vec1[i]:df$vec2[i], collapse = ",")
}
With the output:
vec1 vec2 new_col
1 2 4 2,3,4
2 5 6 5,6
Whats going on and how can I achieve the same with mutate and paste?
We can get the sequence by loop over the vec1, vec2 elements with map2, and paste (str_c) the sequence values to a single string
library(dplyr)
library(purrr)
library(stringr)
test_df %>%
mutate(new_col = map2_chr(vec1, vec2, ~ str_c(.x:.y, collapse = ",")))
-output
vec1 vec2 new_col
1 2 4 2,3,4
2 5 6 5,6
Or with rowwise
test_df %>%
rowwise %>%
mutate(new_col = str_c(vec1:vec2, collapse = ",")) %>%
ungroup
# A tibble: 2 × 3
vec1 vec2 new_col
<dbl> <dbl> <chr>
1 2 4 2,3,4
2 5 6 5,6
Related
I have the following data,
col <- c('Data1,Data2','a,b,c','d')
df <- data.frame(col)
I want to split the data where the elements are more than 2 in a cell. So "a,b,c" should be split into "a,b" , "b,c" and "c,a". See attached for reference.
We create a row identifier (row_number()), split the 'col' by the delimiter (separate_rows), grouped by 'rn', summarise on those groups where the number of rows is greater than 1 to get the combn of 'col' and paste them together
library(stringr)
library(dplyr)
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
separate_rows(col) %>%
group_by(rn) %>%
summarise(col = if(n() > 1) combn(col, 2, FUN = str_c, collapse=",") else col,
.groups = 'drop') %>%
select(-rn)
-output
# A tibble: 5 x 1
# col
# <chr>
#1 Data1,Data2
#2 a,b
#3 a,c
#4 b,c
#5 d
Here is a base R option using combn
data.frame(col = unlist(sapply(
strsplit(df$col, ","),
function(x) {
if (length(x) == 1) {
x
} else {
combn(x, 2, paste0, collapse = ",")
}
}
)))
which gives
col
1 Data1,Data2
2 a,b
3 a,c
4 b,c
5 d
library(tidyverse)
df %>%
rowwise()%>%
mutate(col = list(if(str_count(col, ",")>1) combn(strsplit(col, ",")[[1]], 2, toString) else col))%>%
unnest(col)
# A tibble: 5 x 1
col
<chr>
1 Data1,Data2
2 a, b
3 a, c
4 b, c
5 d
I have a dataframe with two columns per sample (n > 1000 samples):
df <- data.frame(
"sample1.a" = 1:5, "sample1.b" = 2,
"sample2.a" = 2:6, "sample2.b" = c(1, 3, 3, 3, 3),
"sample3.a" = 3:7, "sample3.b" = 2)
If there is a zero in column .b, the correspsonding value in column .a should be set to NA.
I thought to write a function over colnames (without suffix) to filter each pair of columns and conditional exchaning values. Is there a simpler approach based on tidyverse?
We can split the data.frame into a list of data.frames and do the replacement in base R
df1 <- do.call(cbind, lapply(split.default(df,
sub("\\..*", "", names(df))), function(x) {
x[,1][x[2] == 0] <- NA
x}))
Or another option is Map
acols <- endsWith(names(df), "a")
bcols <- endsWith(names(df), "b")
df[acols] <- Map(function(x, y) replace(x, y == 0, NA), df[acols], df[bcols])
Or if the columns are alternate with 'a', 'b' columns, use a logical index for recycling, create the logical matrix with 'b' columns and assign the corresponding values in 'a' columns to NA
df[c(TRUE, FALSE)][df[c(FALSE, TRUE)] == 0] <- NA
or an option with tidyverse by reshaping into 'long' format (pivot_longer), changing the 'a' column to NA if there is a correspoinding 0 in 'a', and reshape back to 'wide' format with pivot_wider
library(dplyr)
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn, names_sep="\\.",
names_to = c('group', '.value')) %>%
mutate(a = na_if(b, a == 0)) %>%
pivot_wider(names_from = group, values_from = c(a, b)) %>%
select(-rn)
# A tibble: 5 x 6
# a_sample1 a_sample2 a_sample3 b_sample1 b_sample2 b_sample3
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 2 1 2 2 1 2
#2 2 3 2 2 3 2
#3 2 3 2 2 3 2
#4 2 3 2 2 3 2
#5 2 3 2 2 3 2
I have a numeric vector with names following a pattern. The name for each element consists of two parts. There are a fixed number of variations on the first part and a fixed number of variations on the second part per the below.
x <- c(2, 4, 3, 7, 6, 9)
names(x) <- c("a.0", "b.0", "c.0", "a.1", "b.1", "c.1")
From this I want to create and print a table where the first part of the names is the rows and the second part the columns per the below.
a b c
0 2 4 3
1 7 6 9
Here are some possibilities. The first 3 only use base R.
1) tapply Use tapply with the row and column parts specified in the second argument.
nms <- names(x)
tapply(x, list(row = sub(".*\\.", "", nms), col = sub("\\..*", "", nms)), c)
giving the following matrix with the indicated row and column names.
col
row a b c
0 2 4 3
1 7 6 9
2) xtabs Another possibility is to use xtabs:
dnms <- read.table(text = names(x), sep = ".", as.is = TRUE,
col.names = c("col", "row"))[2:1]
xtabs(x ~ ., dnms)
giving this xtabs/table object:
col
row a b c
0 2 4 3
1 7 6 9
3) reshape
long <- cbind(x, read.table(text = names(x), sep = ".", as.is = TRUE,
col.names = c("col", "row")))
r <- reshape(long, dir = "wide", idvar = "row", timevar = "col")[-1]
dimnames(r) <- lapply(long[3:2], unique)
r
giving this data.frame:
a b c
0 2 4 3
1 7 6 9
4) dplyr/tidyr/tibble Using the indicated packages we can form the following pipeline:
library(dplyr)
library(tidyr)
library(tibble)
x %>%
stack %>%
separate(ind, c("col", "rowname")) %>%
pivot_wider(names_from = col, values_from = ".") %>%
column_to_rownames
giving this data.frame:
a b c
0 2 4 3
1 7 6 9
If you are using an older version of tidyr replace the pivot_wider line with
spread(col, values) %>%
As per #d.b. comment this would also work:
x %>%
data.frame %>%
rownames_to_column %>%
separate(rowname, c("col", "rowname")) %>%
pivot_wider(names_from = col, values_from = ".") %>%
column_to_rownames
do.call(rbind, split(x, gsub(".*\\.(.*)", "\\1", names(x))))
# a.0 b.0 c.0
#0 2 4 3
#1 7 6 9
There is my problem that I can't solve it:
Data:
df <- data.frame(f1=c("a", "a", "b", "b", "c", "c", "c"),
v1=c(10, 11, 4, 5, 0, 1, 2))
data.frame:f1 is factor
f1 v1
a 10
a 11
b 4
b 5
c 0
c 1
c 2
# What I want is:(for example, fetch data with the number of element of some level == 2, then to data.frame)
a b
10 4
11 5
Thanks in advance!
I might be missing something simple here , but the below approach using dplyr works.
library(dplyr)
nlevels = 2
df1 <- df %>%
add_count(f1) %>%
filter(n == nlevels) %>%
select(-n) %>%
mutate(rn = row_number()) %>%
spread(f1, v1) %>%
select(-rn)
This gives
# a b
# <int> <int>
#1 10 NA
#2 11 NA
#3 NA 4
#4 NA 5
Now, if you want to remove NA's we can do
do.call("cbind.data.frame", lapply(df1, function(x) x[!is.na(x)]))
# a b
#1 10 4
#2 11 5
As we have filtered the dataframe which has only nlevels observations, we would have same number of rows for each column in the final dataframe.
split might be useful here to split df$v1 into parts corresponding to df$f1. Since you are always extracting equal length chunks, it can then simply be combined back to a data.frame:
spl <- split(df$v1, df$f1)
data.frame(spl[lengths(spl)==2])
# a b
#1 10 4
#2 11 5
Or do it all in one call by combining this with Filter:
data.frame(Filter(function(x) length(x)==2, split(df$v1, df$f1)))
# a b
#1 10 4
#2 11 5
Here is a solution using unstack :
unstack(
droplevels(df[ave(df$v1, df$f1, FUN = function(x) length(x) == 2)==1,]),
v1 ~ f1)
# a b
# 1 10 4
# 2 11 5
A variant, similar to #thelatemail's solution :
data.frame(Filter(function(x) length(x) == 2, unstack(df,v1 ~ f1)))
My tidyverse solution would be:
library(tidyverse)
df %>%
group_by(f1) %>%
filter(n() == 2) %>%
mutate(i = row_number()) %>%
spread(f1, v1) %>%
select(-i)
# # A tibble: 2 x 2
# a b
# * <dbl> <dbl>
# 1 10 4
# 2 11 5
or mixing approaches :
as_tibble(keep(unstack(df,v1 ~ f1), ~length(.x) == 2))
Using all base functions (but you should use tidyverse)
# Add count of instances
x$len <- ave(x$v1, x$f1, FUN = length)
# Filter, drop the count
x <- x[x$len==2, c('f1','v1')]
# Hacky pivot
result <- data.frame(
lapply(unique(x$f1), FUN = function(y) x$v1[x$f1==y])
)
colnames(result) <- unique(x$f1)
> result
a b
1 10 4
2 11 5
I'd like code this, may it helps for you
library(reshape2)
library(dplyr)
aa = data.frame(v1=c('a','a','b','b','c','c','c'),f1=c(10,11,4,5,0,1,2))
cc = aa %>% group_by(v1) %>% summarise(id = length((v1)))
dd= merge(aa,cc) #get the level
ee = dd[dd$aa==2,] #select number of level equal to 2
ee$id = rep(c(1,2),nrow(ee)/2) # reset index like (1,2,1,2)
dcast(ee, id~v1,value.var = 'f1')
all done!
I have a list of dataframes which I am trying to apply a script to which works for a single data frame.
Part of the script uses both piping and group_by:
df2 <- df1 %>%
group_by (col1) %>%
summarise(newcol = sum(col2))
I've tried various loops or variations with lapply but haven't been able to find a way for it to work with a lists of dataframes where it would be something along the lines of:
mylist2 <- mylist1 %>%
group_by (col1) %>%
summarise(newcol = sum(col2))
But obviously changed around to work with loops or lapply. I'm probably missing something simple here but would appreciate some help. Thanks
PS - I looked at providing the data from the lists but wasn't able to provide reproducible samples.
Here is a tidyverse way.
# generate some data
mylist1 <- replicate(2, data.frame(col1 = rep(letters[1:2], 2),
col2 = 1:4),
simplify = FALSE)
library(purrr)
library(dplyr)
mylist1 %>%
map(., ~ group_by(., col1) %>%
summarise(new_col = sum(col2)))
#[[1]]
# A tibble: 2 x 2
# col1 new_col
# <fct> <int>
#1 a 4
#2 b 6
#[[2]]
# A tibble: 2 x 2
# col1 new_col
# <fct> <int>
#1 a 4
#2 b 6
In base R you might try lapply and tapply
lapply(mylist1, function(x)
tapply(X = x[["col2"]], INDEX = x[["col1"]], FUN = 'sum'))
#[[1]]
#a b
#4 6
#[[2]]
#a b
#4 6