I have a dataframe with a lot of columns with abbreviations. I'm trying to replace the columns with their full name.
A minimal reproducible example:
category <- data.frame(short = c("TOM", "BAN", "APP", "PEA"),
name = c("tomato", "banana", "apple", "pear"))
df <- data.frame(col1 = c("TOM", "TOM", "TOM", "APP", "TOM"),
col2 = c("APP", "TOM", "TOM", "PEA", "PEA"),
col3 = c("TOM", "PEA", "PEA", "TOM", "BAN"))
col1 col2 col3
1 TOM APP TOM
2 TOM TOM PEA
3 TOM TOM PEA
4 APP PEA TOM
5 TOM PEA BAN
Now, I would like my dataframe to just contain the full names of the products. I can get it to work with left_joins, selecting and renaming, but this code is getting out of hand pretty rapidly with a lot of columns.
df2 <- df %>%
left_join(category, by = c("col1" = "short")) %>%
select(-col1) %>%
rename(col1 = name) %>%
left_join(category, by = c("col2" = "short")) %>%
select(-col2) %>%
rename(col2 = name) %>%
left_join(category, by = c("col3" = "short")) %>%
select(-col3) %>%
rename(col3 = name)
col1 col2 col3
1 tomato apple tomato
2 tomato tomato pear
3 tomato tomato pear
4 apple pear tomato
5 tomato pear banana
I think (hope?) there's a better solution for it, but I'm unable to find it.
An option is to create a named vector
library(dplyr)
library(tibble)
v1 <- deframe(category)
and then use that to match and replace the values
df1 <- df %>%
mutate(across(everything(), ~ v1[.]))
-output
df1
# col1 col2 col3
#1 tomato apple tomato
#2 tomato tomato pear
#3 tomato tomato pear
#4 apple pear tomato
#5 tomato pear banana
It can be also done with recode using similar way
df %>%
mutate(across(everything(), ~ recode(., !!! v1)))
Or using base R, create the named vector with setNames, loop over the columns with lapply and replace those values and assign it back
v1 <- with(category, setNames(name, short))
df1 <- df
df1[] <- lapply(df, function(x) v1[x])
Or convert to matrix (a matrix is a vector with dim attributes)
df1[1] <- v1[as.matrix(df)]
Another option is using factor
df[] <- factor(
u <- unlist(df),
labels = with(category, name[match(sort(unique(u)), short)])
)
or a shorter one via setNames
df[]<-with(category,setNames(name,short))[unlist(df)]
which gives
> df
col1 col2 col3
1 tomato apple tomato
2 tomato tomato pear
3 tomato tomato pear
4 apple pear tomato
5 tomato pear banana
You can get the data in long format such that all the values are in one column which is easy to join with category dataframe and then get data back in wide format.
library(dplyr)
library(tidyr)
df %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row, names_to = 'col', values_to = 'short') %>%
left_join(category, 'short') %>%
select(-short) %>%
pivot_wider(names_from = col, values_from = name) %>%
select(-row)
# col1 col2 col3
# <chr> <chr> <chr>
#1 tomato apple tomato
#2 tomato tomato pear
#3 tomato tomato pear
#4 apple pear tomato
#5 tomato pear banana
Related
fruits <- c("apple", "orange", "pear")
df <- data.frame(string = c("appleorange",
"orangepear",
"applepear"))
Desired outcome:
string
appleorange
apple
orange
orangepear
orange
pear
applepear
apple
pear
Here is one approach using regex along with sub:
regex <- paste0("(?:", paste(fruits, collapse="|"), ")")
df$col1 <- sub(paste0(regex, "$"), "", df$string)
df$col2 <- sub(paste0("^", regex), "", df$string)
df
string col1 col2
1 appleorange apple orange
2 orangepear orange pear
3 applepear apple pear
Data:
fruits <- c("apple", "orange", "pear")
df <- data.frame(string = c("appleorange", "orangepear", "applepear"))
Here is a solution using stringr package:
library(dplyr)
library(stringr)
df %>%
mutate(col1 = str_extract(string, paste(fruits, collapse = '|')),
col2 = str_replace(string, col1, ''))
string col1 col2
1 appleorange apple orange
2 orangepear orange pear
3 applepear apple pear
Using separate
library(dplyr)
library(stringr)
library(tidyr)
separate(df, string, into = c("col1", "col2"),
sep = glue::glue("(?<=[a-z])(?={str_c(fruits, collapse='|')})"), remove = FALSE)
string col1 col2
1 appleorange apple orange
2 orangepear orange pear
3 applepear apple pear
How do you replace values in a column when the value fulfils certain conditions in R?
Here I have two data frames.
Fruits <- c("Apple", "Grape Fruits", "Lemon", "Peach", "Banana", "Orange", "Strawberry", "Apple")
df1 <- data.frame(Fruits)
df1
Fruits
Apple
Grape Fruits
Lemon
Peach
Banana
Orange
Strawberry
Apple
Name <- c("Apple", "Orange", "Lemon", "Grape", "Peach","Pinapple")
Rename <- c("Manzana", "Naranja", "Limon", "Uva", "Melocoton", "Anana")
df2 <- data.frame(Name, Rename)
df2
Name Rename
Apple Manzana
Orange Naranja
Lemon Limon
Grape Uva
Peach Melocoton
Pinapple Anana
I want to replace the values in df1$Fruits to corresponding values in df2$Rename, only when each value in df1$Fruits matches that in df2$Name.
So the designated data frame would be like this.
Fruits
Manzana
Grape Fruits
Limon
Melocoton
Banana
Naranja
Strawberry
Manzana
Does anybody know how to do this? Thank you very much for your help.
using plyr
library(plyr)
new.fruits <- mapvalues(Fruits, from = Name, to = Rename)
df <- data.frame(Fruits=new.fruits)
You can use merge and then replace all NA by their respective fruits.
df3 <- merge(df1,df2, by.x = "Fruits", by.y = "Name", all.x = T)
df3$Rename[is.na(df3$Rename)] <- df3$Fruits[is.na(df3$Rename)]
If you need to keep the order:
df1$id <- 1:nrow(df1)
df3 <- merge(df1,df2, by.x = "Fruits", by.y = "Name", all.x = T)
df3$Rename[is.na(df3$Rename)] <- df3$Fruits[is.na(df3$Rename)]
df3 <- df3[order(df3$id),]
data.frame(Fruits = df3[,"Rename"])
# Fruits
# 1 Manzana
# 2 Grape Fruits
# 3 Limon
# 4 Melocoton
# 5 Banana
# 6 Naranja
# 7 Strawberry
# 8 Manzana
Shorter match solution from #Wen below
df1$new=df2$Rename[match(df1$Fruits,df2$Name)]
df1$new[is.na(df1$new)] <- df1$Fruits[is.na(df1$new)]
Using apply with pmatch can be provide desired output.
df1$Fruits <- apply(df1,1,function(x){
matched = (df2$Name == x)
if(any(matched)){
as.character(df2$Rename[matched])
} else {
x
}})
df1
# Fruits
# 1 Manzana
# 2 Grape Fruits
# 3 Limon
# 4 Melocoton
# 5 Banana
# 6 Naranja
# 7 Strawberry
# 8 Manzana
I am trying to seperate a couple of user filled variables into multiple columns. I have tried to use the spread function, but I am running into some problems. For example, the database looks like this:
SubjID Input1 Input2
1 Banana NA
2 Apple NA
3 NA Banana
4 Apple Banana
And I am trying to get it to look like this:
SubjID Input1 Input2 Banana Apple
1 Banana NA Banana NA
2 Apple NA NA Apple
3 NA Banana Banana Na
4 Apple Banana Banana Apple
I can use the spread function in tidyr to separate input 1, but the problem comes with input 2. I am able to spread it, but I cant put the values into the previously created Banana column, it will instead create two banana columns, which I cannot figure out how to merge correctly. Is there any way to have it sort into the columns correctly? I am new to R and am having a lot of trouble with this aspect of the database. There are too many options for me to discretely state banana and apple and I am really unsure of how to do this.
We may need to gather first before doing the spread
library(dplyr)
library(tidyr)
df1 %>%
gather(key, val, -SubjID, na.rm = TRUE) %>%
mutate(key1 = val) %>%
select(-key) %>% spread(key1, val) %>%
left_join(df1, ., by = 'SubjID')
# SubjID Input1 Input2 Apple Banana
#1 1 Banana <NA> <NA> Banana
#2 2 Apple <NA> Apple <NA>
#3 3 <NA> Banana <NA> Banana
#4 4 Apple Banana Apple Banana
data
df1 <- structure(list(SubjID = 1:4, Input1 = c("Banana", "Apple", NA,
"Apple"), Input2 = c(NA, NA, "Banana", "Banana")), .Names = c("SubjID",
"Input1", "Input2"), class = "data.frame", row.names = c(NA,
-4L))
Try this: Assuming that your data.frame is called dat:
dat$Banana <- ifelse(dat$Input1 == "Banana" | dat$Input2 == "Banana", "Banana", NA)
dat$Apple <- ifelse(dat$Input1 == "Apple" | dat$Input2 == "Apple", "Apple", NA)
For example, the first line checks row-by-row if either df$Input1 or df$Input2 is "Banana"; if so, it puts "Banana" in the Banana column, else it puts NA.
I have a dataframe that looks like this (I simplify):
df <- data.frame(rbind(c(1, "dog", "cat", "rabbit"), c(2, "apple", "peach", "cucumber")))
colnames(df) <- c("ID", "V1", "V2", "V3")
## ID V1 V2 V3
## 1 1 dog cat rabbit
## 2 2 apple peach cucumber
I would like to create a column containing all possible combinations of variables V1:V3 two by two (order doesn't matter), but keeping a link with the original ID. So something like this.
## ID bigrams
## 1 1 dog cat
## 2 1 cat rabbit
## 3 1 dog rabbit
## 4 2 apple peach
## 5 2 apple cucumber
## 6 2 peach cucumber
My idea: use combn(), mutate() and separate_row().
library(tidyr)
library(dplyr)
df %>%
mutate(bigrams=paste(unlist(t(combn(df[,2:4],2))), collapse="-")) %>%
separate_rows(bigrams, sep="-") %>%
select(ID,bigrams)
The result is not what I expected... I guess that concatenating a matrix (the result of combine()) is not as easy as that.
I have two questions about this: 1) how to debug this code? 2) Is this a good way to do this kind of thing? I'm new on R but I’ve an Open Refine background, so concatenate-split multivalued cells make a lot of sense for me. But is this also the right method with R?
Thanks in advance for any help.
We can do this with data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), melt it to 'long' format, grouped by 'ID', get the combn of 'value' and paste it together
library(data.table)
dM <- melt(setDT(df), id.var = "ID")[, combn(value, 2, FUN = paste, collapse=' '), ID]
setnames(dM, 2, 'bigrams')[]
# ID bigrams
#1: 1 dog cat
#2: 1 dog rabbit
#3: 1 cat rabbit
#4: 2 apple peach
#5: 2 apple cucumber
#6: 2 peach cucumber
I recommend #akrun's "melt first" approach, but just for fun, here are more ways to do it:
library(tidyverse)
df %>%
mutate_all(as.character) %>%
transmute(ID = ID, bigrams = pmap(
list(V1, V2, V3),
function(a, b, c) combn(c(a, b, c), 2, paste, collapse = " ")
))
# ID bigrams
# 1 1 dog cat, dog rabbit, cat rabbit
# 2 2 apple peach, apple cucumber, peach cucumber
(mutate_all(as.character) just because you gave us factors, and factor to character conversion can be surprising).
df %>%
mutate_all(as.character) %>%
nest(-ID) %>%
mutate(bigrams = map(data, combn, 2, paste, collapse = " ")) %>%
unnest(data) %>%
as.data.frame()
# ID bigrams V1 V2 V3
# 1 1 dog cat, dog rabbit, cat rabbit dog cat rabbit
# 2 2 apple peach, apple cucumber, peach cucumber apple peach cucumber
(as.data.frame() just for a prettier printing)
I need to match my values in col1 with col 2 and col3 and if they match i need to add their frequencies.It should display the count from freq1 freq2 and freq3 of the unique values.
col1 freq1 col2 freq2 col3 freq3
apple 3 grapes 4 apple 1
grapes 5 apple 2 orange 2
orange 4 banana 5 grapes 2
guava 3 orange 6 banana 7
I need my output like this
apple 6
grapes 11
orange 12
guava 3
banana 12
I m a beginner.How do I code this in R.
We can use melt from data.table with patterns specified in the measure argument to convert the 'wide' format to 'long' format, then grouped by 'col', we get the sum of 'freq' column
library(data.table)
melt(setDT(df1), measure = patterns("^col", "^freq"),
value.name = c("col", "freq"))[,.(freq = sum(freq)) , by = col]
# col freq
#1: apple 6
#2: grapes 11
#3: orange 12
#4: guava 3
#5: banana 12
If it is alternating 'col', 'freq', columns, we can just unlist the subset of 'col' columns and 'freq' columns separately to create a data.frame (using c(TRUE, FALSE) to recycle for subsetting columns), and then use aggregate from base R to get the sum grouped by 'col'.
aggregate(freq~col, data.frame(col = unlist(df1[c(TRUE, FALSE)]),
freq = unlist(df1[c(FALSE, TRUE)])), sum)
# col freq
#1 apple 6
#2 banana 12
#3 grapes 11
#4 guava 3
#5 orange 12
I think that the easiest to understand for newbie would be creating 3 separate dataframes (I assumed here that your dataframe name is df):
df1 <- data.frame(df$col1, df$freq1)
colnames(df1) <- c("fruit", "freq")
df2 <- data.frame(df$col2, df$freq2)
colnames(df2) <- c("fruit", "freq")
df3 <- data.frame(df$col3, df$freq3)
colnames(df3) <- c("fruit", "freq")
Then bind all dataframes by rows:
df <- rbind(df1, df2, df3)
And at the end group by fruit and sum frequencies using dplyr library.
library(dplyr)
df <- df %>%
group_by(fruit)%>%
summarise(sum(freq))