Spreading user reported variables to multiple columns - r

I am trying to seperate a couple of user filled variables into multiple columns. I have tried to use the spread function, but I am running into some problems. For example, the database looks like this:
SubjID Input1 Input2
1 Banana NA
2 Apple NA
3 NA Banana
4 Apple Banana
And I am trying to get it to look like this:
SubjID Input1 Input2 Banana Apple
1 Banana NA Banana NA
2 Apple NA NA Apple
3 NA Banana Banana Na
4 Apple Banana Banana Apple
I can use the spread function in tidyr to separate input 1, but the problem comes with input 2. I am able to spread it, but I cant put the values into the previously created Banana column, it will instead create two banana columns, which I cannot figure out how to merge correctly. Is there any way to have it sort into the columns correctly? I am new to R and am having a lot of trouble with this aspect of the database. There are too many options for me to discretely state banana and apple and I am really unsure of how to do this.

We may need to gather first before doing the spread
library(dplyr)
library(tidyr)
df1 %>%
gather(key, val, -SubjID, na.rm = TRUE) %>%
mutate(key1 = val) %>%
select(-key) %>% spread(key1, val) %>%
left_join(df1, ., by = 'SubjID')
# SubjID Input1 Input2 Apple Banana
#1 1 Banana <NA> <NA> Banana
#2 2 Apple <NA> Apple <NA>
#3 3 <NA> Banana <NA> Banana
#4 4 Apple Banana Apple Banana
data
df1 <- structure(list(SubjID = 1:4, Input1 = c("Banana", "Apple", NA,
"Apple"), Input2 = c(NA, NA, "Banana", "Banana")), .Names = c("SubjID",
"Input1", "Input2"), class = "data.frame", row.names = c(NA,
-4L))

Try this: Assuming that your data.frame is called dat:
dat$Banana <- ifelse(dat$Input1 == "Banana" | dat$Input2 == "Banana", "Banana", NA)
dat$Apple <- ifelse(dat$Input1 == "Apple" | dat$Input2 == "Apple", "Apple", NA)
For example, the first line checks row-by-row if either df$Input1 or df$Input2 is "Banana"; if so, it puts "Banana" in the Banana column, else it puts NA.

Related

Joining and replacing columns multiple times

I have a dataframe with a lot of columns with abbreviations. I'm trying to replace the columns with their full name.
A minimal reproducible example:
category <- data.frame(short = c("TOM", "BAN", "APP", "PEA"),
name = c("tomato", "banana", "apple", "pear"))
df <- data.frame(col1 = c("TOM", "TOM", "TOM", "APP", "TOM"),
col2 = c("APP", "TOM", "TOM", "PEA", "PEA"),
col3 = c("TOM", "PEA", "PEA", "TOM", "BAN"))
col1 col2 col3
1 TOM APP TOM
2 TOM TOM PEA
3 TOM TOM PEA
4 APP PEA TOM
5 TOM PEA BAN
Now, I would like my dataframe to just contain the full names of the products. I can get it to work with left_joins, selecting and renaming, but this code is getting out of hand pretty rapidly with a lot of columns.
df2 <- df %>%
left_join(category, by = c("col1" = "short")) %>%
select(-col1) %>%
rename(col1 = name) %>%
left_join(category, by = c("col2" = "short")) %>%
select(-col2) %>%
rename(col2 = name) %>%
left_join(category, by = c("col3" = "short")) %>%
select(-col3) %>%
rename(col3 = name)
col1 col2 col3
1 tomato apple tomato
2 tomato tomato pear
3 tomato tomato pear
4 apple pear tomato
5 tomato pear banana
I think (hope?) there's a better solution for it, but I'm unable to find it.
An option is to create a named vector
library(dplyr)
library(tibble)
v1 <- deframe(category)
and then use that to match and replace the values
df1 <- df %>%
mutate(across(everything(), ~ v1[.]))
-output
df1
# col1 col2 col3
#1 tomato apple tomato
#2 tomato tomato pear
#3 tomato tomato pear
#4 apple pear tomato
#5 tomato pear banana
It can be also done with recode using similar way
df %>%
mutate(across(everything(), ~ recode(., !!! v1)))
Or using base R, create the named vector with setNames, loop over the columns with lapply and replace those values and assign it back
v1 <- with(category, setNames(name, short))
df1 <- df
df1[] <- lapply(df, function(x) v1[x])
Or convert to matrix (a matrix is a vector with dim attributes)
df1[1] <- v1[as.matrix(df)]
Another option is using factor
df[] <- factor(
u <- unlist(df),
labels = with(category, name[match(sort(unique(u)), short)])
)
or a shorter one via setNames
df[]<-with(category,setNames(name,short))[unlist(df)]
which gives
> df
col1 col2 col3
1 tomato apple tomato
2 tomato tomato pear
3 tomato tomato pear
4 apple pear tomato
5 tomato pear banana
You can get the data in long format such that all the values are in one column which is easy to join with category dataframe and then get data back in wide format.
library(dplyr)
library(tidyr)
df %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row, names_to = 'col', values_to = 'short') %>%
left_join(category, 'short') %>%
select(-short) %>%
pivot_wider(names_from = col, values_from = name) %>%
select(-row)
# col1 col2 col3
# <chr> <chr> <chr>
#1 tomato apple tomato
#2 tomato tomato pear
#3 tomato tomato pear
#4 apple pear tomato
#5 tomato pear banana

Transforming a list of lists into dataframe

I have a list containing a number of other lists, each of which contain varying numbers of character vectors, with varying numbers of elements. I want to create a dataframe where each list would be represented as a row and each character vector within that list would be a column. Where the character vector has > 1 element, the elements would be concatenated and separated using a "+" sign, so that they can be stored as one string. The data looks like this:
fruits <- list(
list(c("orange"), c("pear")),
list(c("pear", "orange")),
list(c("lemon", "apple"),
c("pear"),
c("grape"),
c("apple"))
)
The expected output is like this:
fruits_df <- data.frame(col1 = c("orange", "pear + orange", "lemon + apple"),
col2 = c("pear", NA, "pear"),
col3 = c(NA, NA, "grape"),
col4 = c(NA, NA, "apple"))
There is no limit on the number of character vectors that can be contained in a list, so the solution needs to dynamically create columns, leading to a df where the number of columns is equal to the length of the list containing the largest number of character vectors.
For every list in fruits you can create a one row dataframe and bind the data.
dplyr::bind_rows(lapply(fruits, function(x) as.data.frame(t(sapply(x,
function(y) paste0(y, collapse = "+"))))))
# V1 V2 V3 V4
#1 orange pear <NA> <NA>
#2 pear+orange <NA> <NA> <NA>
#3 lemon+apple pear grape apple
This is a bit messy but here is one way
cols <- lapply(fruits, function(x) sapply(x, paste, collapse=" + "))
ncols <- max(lengths(cols))
dd <- do.call("rbind.data.frame", lapply(cols, function(x) {length(x) <- ncols; x}))
names(dd) <- paste0("col", 1:ncol(dd))
dd
# col1 col2 col3 col4
# 1 orange pear <NA> <NA>
# 2 pear + orange <NA> <NA> <NA>
# 3 lemon + apple pear grape apple
or another strategy
ncols <- max(lengths(fruits))
dd <- data.frame(lapply(seq.int(ncols), function(x) sapply(fruits, function(y) paste(unlist(y[x]), collapse=" + "))))
names(dd) <- paste0("col", 1:ncols)
dd
But really you need to either build each column or row from your list and then combine them together.
Another approach that melts the list to a data.frame using rrapply::rrapply and then casts it to the required format using data.table::dcast:
library(rrapply)
library(data.table)
## melt to long data.frame
long <- rrapply(fruits, f = paste, how = "melt", collapse = " + ")
## cast to wide data.table
setDT(long)
dcast(long[, .(L1, L2, value = unlist(value))], L1 ~ L2)[, !"L1"]
#> ..1 ..2 ..3 ..4
#> 1: orange pear <NA> <NA>
#> 2: pear + orange <NA> <NA> <NA>
#> 3: lemon + apple pear grape apple

Group columns and store variables in list within dataset in R [duplicate]

This question already has an answer here:
Group by columns and summarize a column into a list
(1 answer)
Closed 2 years ago.
So basically, I have a dataframe like this
fruit vendor size
banana Walmart M
banana Sears L
apple Popeye's XL
orange Footlocker S
apple Popeye's W
banana Walmart L
And I need it to look like this (I tried a group_by but idk how to group sizes as a list in a row.
fruit vendor size
banana Walmart c("M", "L")
banana Sears L
apple Popeye's c("XL","W")
orange Footlocker S
#Tried this
df %>%
group_by(fruit, vendor) %>%
##now what?
Then later on I would like to choose from the list on an ifelse.
inc_list <- c("XS","S")
minc_list <- c("M","L", "W", "XL")
df$counter <- ifelse(unlist(df$size) %in% inc_list , 1, 0)
Doesnt work but I want it to look like. So :
if appears in inc_list then count in counter1 how many there are. Ditto for minc_list and counter2, counts how many of the ones in there are in that list.
fruit vendor size counter1 counter2
banana Walmart c("M", "L") 1 1
banana Sears L 0 1
apple Popeye's c("XL","W") 0 2
orange Footlocker S 1 0
EDIT: Last bit, c("S","S") would only be 1, duplicates from the same list shouldn't count.
You can combine size into a list in summarise :
library(dplyr)
df %>% group_by(fruit, vendor) %>%summarise(size = list(size))
# fruit vendor size
# <chr> <chr> <list>
#1 apple Popeyes <chr [2]>
#2 banana Sears <chr [1]>
#3 banana Walmart <chr [2]>
#4 orange Footlocker <chr [1]>
You can also do this in base R :
aggregate(size~fruit+vendor, df, list)
Or data.table :
library(data.table)
setDT(df)[, .(size = list(size)), .(fruit, vendor)]
data
df <- structure(list(fruit = c("banana", "banana", "apple", "orange",
"apple", "banana"), vendor = c("Walmart", "Sears", "Popeyes",
"Footlocker", "Popeyes", "Walmart"), size = c("M", "L", "XL",
"S", "W", "L")), class = "data.frame", row.names = c(NA, -6L))
We can also convert to string with toString
aggregate(size~fruit+vendor, df, toString)
# fruit vendor size
#1 orange Footlocker S
#2 apple Popeyes XL, W
#3 banana Sears L
#4 banana Walmart M, L

Create variables based on regular expressions with a loop in r

I need help to create variables based on regular expressions.
This is my dataframe:
df <- data.frame(a=c("blue", "red", "yellow", "yellow", "yellow", "yellow", "red"), b=c("apple", "orange", "peach", "lemon", "pineapple", "tomato", NA))
Basically, what I want to do is this, but in one step:
regx_1 <- as.numeric(grep("^[a-z]{5}$", df$b))
regx_2 <- as.numeric(grep("^[a-z]{6,}$", df$b))
df$fruit_1 <- NA
df$fruit_1[regx_1 + 1] <- as.character(df$b[regx_1])
df$fruit_2 <- NA
df$fruit_2[regx_2 + 1] <- as.character(df$b[regx_2])
Here is my try:
regex1 <- "^[a-z]{5}$"
regex2 <- "^[a-z]{6,}$"
regex <- c(regex1, regex1)
make_non_matches_NA <- function(vec, pattern){
df[[newvariable]] <- NA
df[[newvariable]][as.numeric(grep(pattern, vec)) + 1] <- as.character(vec[as.numeric(grep(pattern, vec))])
return(newvariable)
}
df[c("fruit1", "fruit2")] <- lapply(regex, make_non_matches_NA, vec = df$b)
EDIT: Why is my approach wrong? (Please note that the actual problem is bigger, so I have to stick to an approach, where a repetition of a pattern should be avoided)
Any help is much appreciated!
Having numbered items in a your workspace is a good sign that they really belong
to a list, so they are formally linked and we can work with them much more easily. So let's do that first.
regex <- c("^[a-z]{5}$", "^[a-z]{6,}$")
Our core functionality is to copy a source vector, but remove elements that don't match, and leave NA in their place, so we'll make a function for that, and we'll name it explicitly so we understand intuitively what it's doing (and as will our colleagues next reader on SO ;) ) :
make_non_matches_NA <- function(vec, pattern){
# logical indices of matches
matches_lgl <- grepl(pattern, vec)
# the elements which don't match should be NA
vec[!matches_lgl] <- NA
# resulting vector should be returned
vec
}
Let's test this with first pattern
make_non_matches_NA(df$b, regex[[1]])
#> [1] apple <NA> peach lemon <NA> <NA>
#> Levels: apple lemon orange peach pineapple tomato
So far so good! now let's test it with all regex, we avoid for loops when we can generally in R because we have clearer tools like lapply(). Here I want to apply this function to all regex expressions :
lapply(regex, make_non_matches_NA, vec = df$b)
#> [[1]]
#> [1] apple <NA> peach lemon <NA> <NA>
#> Levels: apple lemon orange peach pineapple tomato
#>
#> [[2]]
#> [1] <NA> orange <NA> <NA> pineapple tomato
#> Levels: apple lemon orange peach pineapple tomato
Great, it works!
But I want this in my data.frame, not as a separate list, so I will assign this result to the relevant names in my df directly
df[c("fruit1", "fruit2")] <- lapply(regex, make_non_matches_NA, vec = df$b)
# then print my updated df
df
#> a b fruit1 fruit2
#> 1 1 apple apple <NA>
#> 2 2 orange <NA> orange
#> 3 3 peach peach <NA>
#> 4 4 lemon lemon <NA>
#> 5 5 pineapple <NA> pineapple
#> 6 6 tomato <NA> tomato
tada!
I don't if this qualifies as "at one step" but you could try mutate from the dplyr package:
df <- data.frame(a=c(1:6), b=c("apple", "orange", "peach", "lemon", "pineapple", "tomato"),
stringsAsFactors = FALSE)
Note that I set stringsAsFactors = FALSE inside data.frames.
dplyr::mutate(df, fruit_1 = if_else(grepl("^[a-z]{5}$", b), b, NA_character_),
fruit_2 = if_else(grepl("^[a-z]{6}$", b), b, NA_character_))
a b fruit_1 fruit_2
1 1 apple apple <NA>
2 2 orange <NA> orange
3 3 peach peach <NA>
4 4 lemon lemon <NA>
5 5 pineapple <NA> <NA>
6 6 tomato <NA> tomato

R: Replacing values in a column with corresponding values

How do you replace values in a column when the value fulfils certain conditions in R?
Here I have two data frames.
Fruits <- c("Apple", "Grape Fruits", "Lemon", "Peach", "Banana", "Orange", "Strawberry", "Apple")
df1 <- data.frame(Fruits)
df1
Fruits
Apple
Grape Fruits
Lemon
Peach
Banana
Orange
Strawberry
Apple
Name <- c("Apple", "Orange", "Lemon", "Grape", "Peach","Pinapple")
Rename <- c("Manzana", "Naranja", "Limon", "Uva", "Melocoton", "Anana")
df2 <- data.frame(Name, Rename)
df2
Name Rename
Apple Manzana
Orange Naranja
Lemon Limon
Grape Uva
Peach Melocoton
Pinapple Anana
I want to replace the values in df1$Fruits to corresponding values in df2$Rename, only when each value in df1$Fruits matches that in df2$Name.
So the designated data frame would be like this.
Fruits
Manzana
Grape Fruits
Limon
Melocoton
Banana
Naranja
Strawberry
Manzana
Does anybody know how to do this? Thank you very much for your help.
using plyr
library(plyr)
new.fruits <- mapvalues(Fruits, from = Name, to = Rename)
df <- data.frame(Fruits=new.fruits)
You can use merge and then replace all NA by their respective fruits.
df3 <- merge(df1,df2, by.x = "Fruits", by.y = "Name", all.x = T)
df3$Rename[is.na(df3$Rename)] <- df3$Fruits[is.na(df3$Rename)]
If you need to keep the order:
df1$id <- 1:nrow(df1)
df3 <- merge(df1,df2, by.x = "Fruits", by.y = "Name", all.x = T)
df3$Rename[is.na(df3$Rename)] <- df3$Fruits[is.na(df3$Rename)]
df3 <- df3[order(df3$id),]
data.frame(Fruits = df3[,"Rename"])
# Fruits
# 1 Manzana
# 2 Grape Fruits
# 3 Limon
# 4 Melocoton
# 5 Banana
# 6 Naranja
# 7 Strawberry
# 8 Manzana
Shorter match solution from #Wen below
df1$new=df2$Rename[match(df1$Fruits,df2$Name)]
df1$new[is.na(df1$new)] <- df1$Fruits[is.na(df1$new)]
Using apply with pmatch can be provide desired output.
df1$Fruits <- apply(df1,1,function(x){
matched = (df2$Name == x)
if(any(matched)){
as.character(df2$Rename[matched])
} else {
x
}})
df1
# Fruits
# 1 Manzana
# 2 Grape Fruits
# 3 Limon
# 4 Melocoton
# 5 Banana
# 6 Naranja
# 7 Strawberry
# 8 Manzana

Resources