Concatenating two text columns in dplyr - r

My data looks like this:
round <- c(rep("A", 3), rep("B", 3))
experiment <- rep(c("V1", "V2", "V3"), 2)
results <- rnorm(mean = 10, n = 6)
df <- data.frame(round, experiment, results)
> df
round experiment results
1 A V1 9.782025
2 A V2 8.973996
3 A V3 9.271109
4 B V1 9.374961
5 B V2 8.313307
6 B V3 10.837787
I have a different dataset that will be merged with this one where each combo of round and experiment is a unique row value, ie, "A_V1". So what I really want is a variable name that concatenates the two columns together. However, this is tougher to do in dplyr than I expected. I tried:
name_mix <- paste0(df$round, "_", df$experiment)
new_df <- df %>%
mutate(name = name_mix) %>%
select(name, results)
But I got the error, Column name must be length 1 (the group size), not 6. I also tried the simple base-R approach of cbind(df, name_mix) but received a similar error telling me that df and name_mix were of different sizes. What am I doing wrong?

You can use the unite function from tidyr
require(tidyverse)
df %>%
unite(round_experiment, c("round", "experiment"))
round_experiment results
1 A_V1 8.797624
2 A_V2 9.721078
3 A_V3 10.519000
4 B_V1 9.714066
5 B_V2 9.952211
6 B_V3 9.642900

This should do the trick if you are looking for a new variable
library(tidyverse)
round <- c(rep("A", 3), rep("B", 3))
experiment <- rep(c("V1", "V2", "V3"), 2)
results <- rnorm(mean = 10, n = 6)
df <- data.frame(round, experiment, results)
df
df <- df %>% mutate(
name = paste(round, experiment, sep = "_")
)

You could also try this:
library(tidyr)
library(dplyr)
df = df %>%
unite(combined, round, experiment, sep = "_", remove = FALSE)
The output will be:
combined round experiment results
A_V1 A V1 10.152329
A_V2 A V2 10.863128
A_V3 A V3 10.975773
B_V1 B V1 9.964696
B_V2 B V2 9.876675
B_V3 B V3 9.252936
This will retain your original columns.

Another solution could be to use the stri_join function in stringi package.
library(stringi)
df$new = stri_join(df$round,df$experiment,sep="_")

Related

How to separate integers from string in a data frame cell that are separated by commas?

I currently have a file that has a variety of responses to some questions. Each cell will have anywhere from 1 to 4 numbers, followed by the word "finished" inside of one cell. For example, df[1,1] could equal "-5","2","1","Finished" . I need to be able to get rid of the word finished, and just have the integers so that I can add them together to get one number for that cell. How can i do this?
Another option using R base apply function:
df <- data.frame(X = c('-5,-2,1,Finished','1,2,7,Finished','-3,-2,4,Finished'))
new_df <- apply(df, c(1, 2), FUN = function(x){
values <- trimws(unlist(strsplit(x, split = ","))) # Convert cell values to a vector
values <- values[which(!tolower(values) == "finished")] # Remove Finished
return(sum(as.numeric(values), na.rm = T)) # Add remaining integer values
})
new_df
X
[1,] -6
[2,] 10
[3,] -1
The above will iterate through every cell in a dataframe. For each cell it convert the cell's values to a vector by splitting on commas. Then it will remove the 'finished' value from the vector and finally sum all remaining numeric values. new_df will be a matrix the same size as df.
Maybe you can try the code below
df <- within(df,
Y <- sapply(regmatches(X,gregexpr("[+-]?\\d+",X)),
function(v) sum(as.integer(v))))
such that
> df
X Y
1 -5,-2,1,Finished -6
2 1,2,7,Finished 10
3 -3,-2,4,Finished -1
Dummy Data
df <- data.frame(X = c('-5,-2,1,Finished','1,2,7,Finished','-3,-2,4,Finished'))
One option after reading the file with read.csv/read.table is to use separate_rows to expand the rows after removing the 'Finished', while using convert = TRUE and then get the sum
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
mutate(rn = row_number(), col2 = str_remove(col2, ",\\s*[Ff]inished")) %>%
separate_rows(col2, sep= ",", convert = TRUE) %>%
group_by(rn) %>%
summarise(col3 = sum(col2, na.rm = TRUE)) %>%
select(-rn) %>%
bind_cols(df1, .)
# A tibble: 3 x 3
# col1 col2 col3
# <int> <chr> <int>
#1 1 -5,-2,1,Finished -6
#2 2 -3,-2,5,Finished 0
#3 3 3,4,2,Finished 9
Or using base R
df1$col3 <- sapply(sub(",[Ff]inished", "", df1$col2), function(str1)
sum(scan(text = str1, what = numeric(), sep=",", quiet = TRUE)))
data
df1 <- read.csv('yourfile.csv', stringsAsFactors = FALSE)
df1 <- data.frame(col1 = 1:3, col2 = c('-5,-2,1,Finished',
'-3,-2,5,Finished', '3,4,2,Finished'), stringsAsFactors = FALSE)

Using set_names vs. mutate(colnames) when changing data frame column names to lower case

A quick question that I was looking to understand better.
Data:
df1 <- data.frame(COLUMN_1 = letters[1:3], COLUMN_2 = 1:3)
> df1
COLUMN_1 COLUMN_2
1 a 1
2 b 2
3 c 3
Why does this work in setting data frame names to lower case:
df2 <- df1 %>%
set_names(., tolower(names(.)))
> df2
column_1 column_2
1 a 1
2 b 2
3 c 3
But this does not?
df2 <- df1 %>%
mutate( colnames(.) <- tolower(colnames(.)) )
Error: Column `colnames(.) <- tolower(colnames(.))` must be length 3 (the number of rows) or one, not 2
The solution, writing the arguments out explicitly, is:
df1 %>% rename_all(tolower) ==
rename_all(.tbl = df1, .funs = tolower)
mutate operates on the data itself, not the column names, so that's why we're using rename. We use rename_all because you don't want to type out 1 = tolower(1), 2 = tolower(2), ...
What you suggested, df2 <- df1 %>% rename_all(tolower(.)) doesn't work because then you would be trying to feed the whole df1 into the tolower function, which is not what you want.
Another solution would be this names(df) <- tolower(names(df))

How to numbering unique pairs X,Y

Ok, so I have the following data.frame:
v1<-c(456,234,981,776,112,998)
v2<-c(981,112,456,998,234,776)
df<- data.frame(v1,v2)
I want to obtain an extra variable with a numeric count of pairs of v1 and v2 values. The trick is that I need to number them by unique pairs so, for example (456,981 and 981,456) should be numbered 1.
So the outcome would be something like this:
v1<-c(456,234,981,776,112,998)
v2<-c(981,112,456,998,234,776)
v3<-c(1,2,1,3,2,3)
df<- data.frame(v1,v2,v3)
You can sort rowwise and use match, i.e.
v1 <- do.call(paste, data.frame(t(apply(df, 1, sort))))
match(v1, unique(v1))
#[1] 1 2 1 3 2 3
How about this using dplyr. Basically you would sort the columns for each row. Not sure if it would be more efficient or not. Obviously it is a lot more lines.
library(dplyr)
df <- data.frame(v1,v2)
# Sort by v1 and v2 elements by row
df.new <- df %>%
mutate(z1 = pmin(v1,v2),
z2 = pmax(v1,v2))
# Build a distinct coding table
df.codes <- df.new %>%
distinct(z1, z2) %>%
mutate(v3 = 1:n())
# Join it back together
df.new %>%
left_join(df.codes, by = c("z1", "z2")) %>%
select(v1, v2, v3)

Doing a ranged lookup with multiple variables in a matrix in R

I feel like I have a bit of a complicated problem (or at least for me it is!).
I have a table of prices which will need to be read from a csv which will look exactly like this:
V1 <- c("","Destination","Spain","Spain","Spain","Portugal","Portugal","Portugal","Italy","Italy","Italy")
V2 <- c("","Min_Duration",rep(c(1,3,6),3))
V3 <- c("","Max_Duration",rep(c(2,5,10),3))
V4 <- c("Full-board","Level_1",runif(9,100,200))
V5 <- c("Full-board","Level_2",runif(9,201,500))
V6 <- c("Full-board","Level_3",runif(9,501,1000))
V7 <- c("Half-board","Level_1",runif(9,100,200))
V8 <- c("Half-board","Level_2",runif(9,201,500))
V9 <- c("Half-board","Level_3",runif(9,501,1000))
Lookup_matrix <- as.data.frame(cbind(V1,V2,V3,V4,V5,V6,V7,V8))
The prices in the above table will of course come out a bit strange as they're completely random - but we can ignore that...
I also have a table like this:
Destination <- c("Spain", "Italy", "Portugal")
Duration <- c(2,4,8)
Level <- c(1,3,3)
Board <- c("Half-board","Half-board","Full-board")
Price <- "Empty"
Price_matrix <- as.data.frame(cbind(Destination,Duration,Level,Board,Price))
My question is - how do I populate the 'Price' column of the price matrix with the corresponding prices that can be found in the lookup matrix? Please note that the duration variable of the price matrix will have to fit into a range found between the 'Min_Duration' and 'Max_Duration' columns in the lookup matrix.
In Excel I would use an Index,Match formula. But I'm stumped with R.
Thanks in advance,
Dan
Here is a tidyverse possibility
First, please note that I rename your input objects; both Price_matrix and Lookup_matrix are data.frames (not matrices).
df1 <- Price_matrix
df2 <- Lookup_matrix
Next we need to fix the column names of df2 = Lookup_matrix.
# Fix column names
colnames(df2) <- gsub("^_", "", apply(df2[1:2, ], 2, paste0, collapse = "_"))
df2 <- df2[-(1:2), ]
We now basically do a left join of df1 and df2; in order for df2 to be in a suitable format we spread data from wide to long, extract Price values for every Board and Level, and expand entries from Min_Duration to Max_Duration. Then we join by Destination, Duration, Level and Board.
Note that in your example, Destination = Italy has no Level = 3 entry in Lookup_matrix; we therefore get Price = NA for this entry.
library(tidyverse)
left_join(
df1 %>%
mutate_if(is.factor, as.character) %>%
select(-Price),
df2 %>%
mutate_if(is.factor, as.character) %>%
gather(key, Price, -Destination, -Min_Duration, -Max_Duration) %>%
separate(key, into = c("Board", "Level"), sep = "_", extra = "merge") %>%
mutate(Level = sub("Level_", "", Level)) %>%
rowwise() %>%
mutate(Duration = list(seq(as.numeric(Min_Duration), as.numeric(Max_Duration)))) %>%
unnest() %>%
select(-Min_Duration, -Max_Duration) %>%
mutate(Duration = as.character(Duration)))
#Joining, by = c("Destination", "Duration", "Level", "Board")
# Destination Duration Level Board Price
#1 Spain 2 1 Half-board 119.010942545719
#2 Italy 4 3 Half-board <NA>
#3 Portugal 8 3 Full-board 764.536124917446
Using datatable:
library(data.table)
nms = trimws(do.call(paste, transpose(Lookup_matrix[1:2, ])))# column names
cat(do.call(paste, c(collapse="\n", Lookup_matrix[-(1:2), ])), file = "mm.csv")
# Rewrite the data in the correct format. You do not have to.
# Just doing Lookup_matrix1 = setNames(Lookup_matrix[-(1:2),],nms) is enough
# but it will not have rectified the column classes.
Lookup_matrix1 = fread("mm.csv", col.names = nms)
melt(Lookup_matrix1, 1:3)[,
c("Board", "Level") := .(sub("[.]", "-", sub("\\.Leve.*", "", variable)), sub("\\D+", "", variable))][
Price_matrix[, -5], on=c("Destination", "Board", "Level", "Min_Duration <= Duration", "Max_Duration >= Duration")]
Destination Min_Duration Max_Duration variable value Board Level
1: Spain 2 2 Half.board.Level_1 105.2304 Half-board 1
2: Italy 4 4 <NA> NA Half-board 3
3: Portugal 8 8 Full.board.Level_3 536.5132 Full-board 3

Making new column with multiple elements after group_by

I'm trying to make a new column as described below. the d's actually correspond to dates and V2 are events on the given dates. I need to collect the events for the given date. V3 is a single column whose row entries are a concatenation. Thanks in advance. My attempt does not work.
df = V1 V2
d1 U
d2 M
d1 T
d1 Q
d2 P
desired resulting df
df.1 = V1 V3
d1 U,T,Q
d2 M,P
df.1 <- df %>% group_by(., V1) %>%
mutate(., V3 = c(distinct(., V2))) %>%
as.data.frame
The above code results in the following error; ignore the 15 and 1s--they're specific to my actual code
Error: incompatible size (15), expecting 1 (the group size) or 1
You can use aggregate like this:
df.1 <- aggregate(V2~V1,paste,collapse=",",data=df)
# V1 V2
# 1 d1 U,T,Q
# 2 d2 M,P
It will not allow a vector as an element in data frame. So instead of using c(), you can use paste to concatenate elements as a single string.
df.1 <- df %>% group_by(V1) %>% mutate(V3 = paste(unique(V2), collapse = ",")) %>% select(V1, V3) %>% unique() %>% as.data.frame()
still with dplyr, you can try:
df %>% group_by(V1) %>% summarize(V3 = paste(unique(V2), collapse=", "))

Resources