Using spread to create two value columns with tidyr - r

I have a data frame that looks just like this (see link). I'd like to take the output that is produced below and go one step further by spreading the tone variable across both the n and the average variables. It seems like this topic might bear on this, but I can't get it to work:
Is it possible to use spread on multiple columns in tidyr similar to dcast?
I'd like the final table to have the source variable in one column, then then the tone-n and tone-avg variables to be in columns. So I'd like the column headers to be "source" - "For - n" - "Against - n" "For -Avg" - "Against - Avg". This is for publication, not for further calculation, so it's about presenting data. It seems more intuitive to me to present data in this way. Thank you.
#variable1
Politician.For<-sample(seq(0,4,1),50, replace=TRUE)
#variable2
Politician.Against<-sample(seq(0,4,1),50, replace=TRUE)
#Variable3
Activist.For<-sample(seq(0,4,1),50,replace=TRUE)
#variable4
Activist.Against<-sample(seq(0,4,1),50,replace=TRUE)
#dataframe
df<-data.frame(Politician.For, Politician.Against, Activist.For,Activist.Against)
#tidyr
df %>%
#Gather all columns
gather(df) %>%
#separate by the period character
#(default separation character is non-alpha numeric characterr)
separate(col=df, into=c('source', 'tone')) %>%
#group by both source and tone
group_by(source,tone) %>%
#summarise to create counts and average
summarise(n=sum(value), avg=mean(value)) %>%
#try to spread
spread(tone, c('n', 'value'))

I think what you want is another gather to break out the count and mean as separate observations, the gather(type, val, -source, -tone) below.
gather(df, who, value) %>%
separate(who, into=c('source', 'tone')) %>%
group_by(source, tone) %>%
summarise(n=sum(value), avg=mean(value)) %>%
gather(type, val, -source, -tone) %>%
unite(stat, c(tone, type)) %>%
spread(stat, val)
Yields
Source: local data frame [2 x 5]
source Against_avg Against_n For_avg For_n
1 Activist 1.82 91 1.84 92
2 Politician 1.94 97 1.70 85

Using data.table syntax (thanks #akrun):
library(data.table)
dcast(
setDT(melt(df))[,c('source', 'tone'):=
tstrsplit(variable, '[.]')
][,list(
N = sum(value),
avg= mean(value))
,by=.(source, tone)],
source~tone,
value.var=c('N','avg'))

Related

Apply function on data.frame with mutate across using the same columns from another data.frame

I have two data frames with spectral bands from a satellite, redDF and nirDF. Both data frames have values per date column starting with an 'X', these names correspond in both data frames.
I want to get a new data frame where for each column starting with an 'X' in both redDF and nirDF a new value is calculated according to some formula.
Here is a data sample:
library(dplyr)
set.seed(999)
# get column names
datecolnames <- seq(as.Date("2015-05-01", "%Y-%m-%d"),
as.Date("2015-09-20", "%Y-%m-%d"),
by="16 days") %>%
format(., "%Y-%m-%d") %>%
paste0("X", .)
# sample data values
mydata <- as.integer(runif(length(datecolnames))*1000)
# sample no data indices
nodata <- sample(1:length(datecolnames), length(datecolnames)*0.3)
mydata[nodata] <- NA # assign no data to the correct indices
# get dummy data.frame of red spectral values
redDF <- data.frame(mydata,
mydata[sample(1:length(mydata))],
mydata[sample(1:length(mydata))]) %>%
t() %>%
as.data.frame(., row.names = FALSE) %>%
rename_with(~datecolnames) %>%
mutate(id = row_number()+1142) %>%
select(id, everything())
# get dummy data.frame of near infrared spectral values
# in this case a modified version of redDF
nirDF <- redDF %>%
mutate(across(-id,~as.integer(.x+20*1.8))) %>%
select(id, everything())
> nirDF
id X2015-05-01 X2015-05-17 X2015-06-02 X2015-06-18 X2015-07-04 X2015-07-20 X2015-08-05
1 1143 NA 645 NA 636 569 841 706
2 1144 1025 NA 706 569 354 NA NA
3 1145 904 636 706 645 NA NA 115
X2015-08-21 X2015-09-06 X2015-09-22 X2015-10-08 X2015-10-24 X2015-11-09
1 115 1025 904 NA 409 354
2 115 636 409 645 841 904
3 569 409 354 841 1025 NA
and this is the formula:
getNDVI <- function(red, nir){round((nir - red)/(nir + red), digits = 4)}
I hoped I would be able to do something like:
ndviDF <- redDF %>% mutate(across(starts_with('X'), .fns = getNDVI))
But that doesn't work, as dplyr doesn't know what the nir argument of getNDVI should be. I have seen solutions for accessing other data frames in mutate() by using the $COLNAME indexer, but since I have 197 columns, that is not an option here.
I would approach this with a for loop, though I know it does not make best use of functionality like across.
First we create a list of the columns we want to iterate over:
cols_to_iterate_over = redDF %>%
select(starts_with("X") %>%
colnames()
Then we join on id and ensure columns are named according to source dataset:
joined_df = redDF %>%
inner_join(nirDF, by = "id", prefix = c("_red","_nir"))
So joined_df should have columns like:
id X2015-05-01_red X2015-05-01_NIR X2015-05-17_red X2015-05-17_NIR ...
Then we can loop over these:
for(col in cols_to_iterate_over){
# columns for calculation
red_col = paste0(col,"_red") %>% sym()
nir_col = paste0(col,"_nir") %>% sym()
out_col = col %>% sym()
# calculate
joined_df = joined_df %>%
mutate(
!!out_col := round((!!nir_col - !!red_col)/(!!nir_col + !!red_col),
digits = 4)
) %>%
select(-!!red_col, -!!nir_col)
}
Explanation: We can use text strings as variable names if we turn them into symbols and then !! them.
sym() turns text into symbols,
!! inside dplyr commands turns symbols into code,
and := is equivalent to = but permits us to have !! on the left-hand side.
Sorry, this is slightly old syntax. For the current approaches see programming with dplyr.
In its most basic form, you can just do this:
round((nirDF - redDF)/(nirDF + redDF), digits = 4)
But this does not retain the id-column and can break if some columns are not numeric. A more failsafe version would be:
red <- redDF %>%
arrange(id) %>% # be sure to apply the same order everywhere
select(starts_with('X')) %>%
mutate(across(everything(), as.numeric)) # be sure to have numeric columns
nir <- nirDF %>% arrange(id) %>%
select(starts_with('X')) %>%
mutate(across(everything(), as.numeric))
# make sure that the number of rows are equal
if(nrow(red) == nrow(nir)){
ndvi <- redDF %>%
# get data.frame with ndvi values
transmute(round((nir - red)/(nir + red), digits = 4)) %>%
# bind id-column and possibly other columns to the data frame
bind_cols(redDF %>% arrange(id) %>% select(!starts_with('X'))) %>%
# place the id-column to the front
select(!starts_with('X'), everything())
}
As far as I have understood dplyr by now, it boils down to this:
across is (generally) meant for many-to-many relationships, but handles columns on an individual basis by default. So, if you give it three columns, it will give you three columns back which are not aware of the values in other columns.
c_across on the other hand, can evaluate relationships between columns (like a sum or a standard deviation) but is meant for many-to-one relationships. In other words, if you give it three columns, it will give you one column back.
Neither of these is suitable for this task. However, by design, arithmetic operations can be applied to data frames in R (just try cars*cars for instance). This is what we need in this case. Luckily, these operations are not as greedy as dplyr join operations, so they can be done efficiently on large data frames.
While doing so, you need to keep some requirements into account:
The number of rows of the two data frames should be equal, otherwise, the shorter data frame will get recycled.
all columns in the data frame need to be of a numeric class (numeric or integer).

R: What is the expected output of passing a character vector to dplyr::all_of()?

I am trying to understand the expected output of dplyr::group_by() in conjunction with the use of dplyr::all_of(). My understanding is that using dplyr::all_of() should convert character vectors containing variable names to the bare names so that group_by(), but this doesn't appear to happen.
Below, I generate some fake data, pass different objects to group_by() with(out) all_of() and calculate the number of observations in each group. In the example, passing a single bare column name without dplyr::all_of() produces the correct output: one row per unique value of the column. However, passing character vectors or using dplyr::all_of() produces incorrect output: one row regardless of the number of values in a column.
What is expected when using all_of and how might I alternatively pass a character vector to group_by to process as a vector of bare names?
library(dplyr)
# Create a 20-row data.frame with
# 2 variables each with 2 unique values.
df <- data.frame(var = rep(c("a", "b"), 10),
bar = rep(c(1, 2), 20))
# Output 1: 2x2 tibble - GOOD
df %>% group_by(var) %>% summarize(n = n())
# Output 2: 1x2 tibble - BAD
foo <- "var"
df %>% group_by(all_of(foo)) %>% summarize(n = n())
# Output 3: 1x2 tibble
df %>% group_by("var") %>% summarize(n = n())
# Output 4: Error in_var not found - BAD
foo2 <- list("var", "bar")
lapply(foo2, function(in_var) {
df %>%
group_by(in_var) %>%
summarize(n = n())
})
# Output 5: list of length 2 where
# each element is a 1x2 tibble - BAD
foo2 <- list("var", "bar")
lapply(foo2, function(in_var) {
df %>%
group_by(all_of(in_var)) %>%
summarize(n = n())
})
We can use group_by_at
lapply(foo2, function(in_var) df %>%
group_by_at(all_of(in_var)) %>%
summarise(n = n()))
-output
#[[1]]
# A tibble: 2 x 2
# var n
#* <chr> <int>
#1 a 20
#2 b 20
#[[2]]
# A tibble: 2 x 2
# bar n
#* <dbl> <int>
#1 1 20
#2 2 20
As across replaces some of the functionality of group_by_at, we can use it instead with all_of:
lapply(foo2, function(in_var) df %>%
group_by(across(all_of(in_var))) %>%
summarise(n = n()))
Or convert to symbol and evaluate (!!)
lapply(foo2, function(in_var) df %>%
group_by(!! rlang::sym(in_var)) %>%
summarise(n = n()))
Or use map
library(purrr)
map(foo2, ~ df %>%
group_by(!! rlang::sym(.x)) %>%
summarise(n = n()))
Or instead of group_by, it can be count
map(foo2, ~ df %>%
count(across(all_of(.x))))
To add to #akrun's answers of mutliple ways to achieve the desired output - my understanding of all_of() is that, it is a helper for selection of variables stored as character for dplyr function and uses vctrs underneath. Compared to any_of() which is a less strict version of all_of() and some convenient use cases.
reading the ?tidyselect::all_off() is helpful. This page is also helpful to keep up with changes in dplyr and tidy evaluation https://dplyr.tidyverse.org/articles/programming.html.
The scoped dplyr verbs are being superceded in the future with across based on decisions by the devs at RStudio. See ?group_by_at() or other *_if, *_at, *_all documentation. So I guess it really depends on what version of dplyr you are using in your workflow and what works best for you.
This SO post also gives context of changes in solutions over time with passing characters into dplyr functions, and there's probably more posts out there.

Determine the size of string in a particular cell in dataframe: R

In a data frame, I have a column (type: chr) that contains answers separated by a comma. I want to create another column based on the size of the string and award points. For example, some of the entries in a column are:
Column1
word1,word2,word3
word1,word2
word1
Now, for the first cell, I want the size of the cell to be evaluated as 3 (as it contains three distinct word and there are no duplicates in the cell values). I'm not sure how do I achieve this.
An option is to split the column with strsplit into a list of vectors, get the unique elements by looping over the list with lapply and get the lengths
df1$Size <- lengths(lapply(strsplit(df1$Column1, ",\\s*"), unique))
Another option is separate_rows from tidyr
library(dplyr)
library(tidyr)
df1 %>%
mutate(rn = row_number()) %>%
separate_rows(Column1) %>%
group_by(rn) %>%
summarise(Size = n_distinct(Column1), .groups = 'drop') %>%
select(Size) %>%
bind_cols(df1, .)
-output
# Column1 Size
#1 word1,word2,word3 3
#2 word1,word2 2
#3 word1 1
data
df1 <- data.frame(Column1 = c('word1,word2,word3', 'word1,word2', 'word1'))
Original Answer:
Another option:
library(dplyr)
library(stringr)
df %>%
mutate(Lengths = str_count(Column1, ",") + 1)
Edit:
I hadn't noticed the OP requirements properly (about non-duplicates). As #Onyambu pointed out in the comments, this chunk will only works if there are no duplicated words in data.
It basically counts how many words there are.

Sum columns based on index in a a different data frame in R

I have two data frames similar to this:
df<-data.frame("A1"=c(1,2,3), "A2"=c(3,4,5), "A3"=c(6,7,8), "B1"=c(3,4,5))
ref_df<-data.frame("Name"=c("A1","A2","A3","B1"),code=c("Blue" ,"Blue","Green","Green"))
I would like to sum the values in the columns of df based on the code in the ref_df. I would like to store the results in a new data frame with column names matching the code in the ref_df
i.e. I would like a new data frame with Blue and Green as columns and the values representing the sum of A1+A2 and A3&B1 respectively. Like the one here:
result<-data.frame("Blue"=c(4,6,8), "Green"=c(9,11,13))
There are lots of post on summing columns based on conditions, but after a morning of research I cannot find any thing that solves my exact problem.
We can split the columns in df based on values in ref_df$code and then take row-wise sum.
sapply(split.default(df, ref_df$code), rowSums)
# Blue Green
#[1,] 4 9
#[2,] 6 11
#[3,] 8 13
If the order in ref_df do not follow the same order as column names in df, arrange them first.
ref_df <- ref_df[match(ref_df$Name, names(df)),]
We can use tidyverse
library(dplyr)
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn, names_to = 'Name') %>%
left_join(ref_df) %>%
group_by(code, rn) %>%
summarise(Sum = sum(value)) %>%
pivot_wider(names_from = code, values_from = Sum) %>% select(-rn)

Iterating (Looping) through columns of a dataframe in R

I am struggling in R and hope that someone can help me out. I am trying to write a for loop to iterate over the columns of a data frame, but unfortunately, I am not successful.
So here is my Problem:
I have 10 data frames (dt1, dt2 ,dt3,…,dt10). For example, dt1 looks like this:
dt1<-data.frame(Topic1=c(1,2,3,4,5,6,7,8,9),Topic2=c(9,8,7,6,5,4,3,2,1), Topic3=c(1,9,2,8,3,7,4,6,5), Name=c("A","A","A","A","A","B","B","B","B"))
I want to check if the Name variable still contains “A” and “B” when I filter I filter Topic 1 (then Topic 2, Topic3…) to greater than 5. At the moment, I do the following
Library(dpylr)
dt.new<-dt1 %>% filter(Topic1>5)
isTRUE("A" %in% dt.new$Name && "B" %in% dt.new$Name)
At the end of the day, for each data frame, I want to have a new table (data frame) that looks like this:
result<-data.frame(Topic=c("Topic1","Topic2","Topic3"),Return=c("FALSE","FALSE","TRUE"))
Now the problem is, that I have several data frames (dt1, dt2…) each of them has more than 50 variables (Topic1,…, Topic50).
I've written some loops so far and tried it out. But unfortunately without success. Therefore I would be happy to receive any hint or tip.
Thank you very much!
An option would be to group by 'Name', summarise the variable that have column names that start with 'Topic' by checking if there are any value that are greater than 5, then gather (getting deprecated - in the newer tidyr - use pivot_longer) to convert from 'wide' to 'long', grouped by 'Topic' column, summarise by checking if all the 'val' elements are TRUE
library(dplyr)
library(tidyr)
dt1 %>%
group_by(Name) %>%
summarise_at(vars(starts_with('Topic')), ~ any(. > 5)) %>%
gather(Topic, val, -Name) %>%
group_by(Topic) %>%
summarise(Return = all(val))
# A tibble: 3 x 2
# Topic Return
# <chr> <lgl>
#1 Topic1 FALSE
#2 Topic2 FALSE
#3 Topic3 TRUE
Or reshape it to 'long' format first and then do the summariseation
dt1 %>%
pivot_longer(cols = -Name, names_to = "Topic") %>%
filter(value > 5) %>%
group_by(Topic) %>%
summarise(result = n_distinct(Name) == 2)

Resources