Merging strings in R - r

Does anyone know if there are any ways I can turn dataset from the left side to the right side in excel macro or other programming tools?
Use R code to manipulate dataset on the left side to the one at the right side

This doesn't exactly match what you are looking for, but further data analysis might be easier formatted this way
library(tidyverse)
df = read_csv("Downloads/test.csv")
df = df %>% group_by(Transaction, Item) %>% summarize(count_value = n())
result = df %>% pivot_wider(names_from = "Item", values_from = "count_value", values_fill = 0)

I pivot data this way:
library(tidyverse)
df <- data.frame(Transaction=c(1,2,2),
Item=c("Apple", "Banana", "Coconut"))
df
df <- df %>%
group_by(Transaction) %>%
mutate(ItemNumber=paste0("Item", row_number()))
df
df <- df %>%
pivot_wider(names_from = ItemNumber,
values_from = Item)
df
# Transaction Item1 Item2
#1 1 Apple NA
#2 2 Banana Coconut

Related

R transform dataframe by parsing columns

Context
I have created a small sample dataframe to explain my problem. The original one is larger, as it has many more columns. But it is formatted in the same way.
df = data.frame(Case1.1.jpeg.text="the",
Case1.1.jpeg.text.1="big",
Case1.1.jpeg.text.2="DOG",
Case1.1.jpeg.text.3="10197",
Case1.2.png.text="framework",
Case1.3.jpg.text="BE",
Case1.3.jpg.text.1="THE",
Case1.3.jpg.text.2="Change",
Case1.3.jpg.text.3="YOUWANTTO",
Case1.3.jpg.text.4="SEE",
Case1.3.jpg.text.5="in",
Case1.3.jpg.text.6="theWORLD",
Case1.4.png.text="09.80.56.60.77")
The dataframe consists of output from a text detection ML model based on a certain number of input images.
The output format makes each word for each image a separate column, thereby creating a very wide dataset.
Desired Output
I am looking to create a cleaner version of it, with one column containing the image name (e.g. Case1.2.png) and the second with the concatenation of all possible words that the model finds in that particular image (the number of words varies from image to image).
result = data.frame(Case=c('Case1.1.jpeg','Case1.2.png','Case1.3.jpg','Case1.4.png'),
Text=c('thebigDOG10197','framework','BETHEChangeYOUWANTTOSEEintheWORLD','09.80.56.60.77'))
I have tried many approaches based on similar questions found on Stackoverflow, but none seem to give me the exact output I'm looking for.
Any help on this would be greatly appreciated.
library(tidyr)
library(dplyr)
df %>%
pivot_longer(cols = everything(),
names_pattern = "(.*)\\.(text.*)",
names_to = c("Case", NA)) %>%
group_by(Case) %>%
summarize(value = paste(value, collapse = ""), .groups = "drop")
Alternatively, this can be accomplished using just the pivot functions from tidyr:
library(tidyr)
library(stringr)
df %>%
pivot_longer(cols = everything(),
names_pattern = "(.*)\\.(text).*",
names_to = c("Case", "cols")) %>%
pivot_wider(id_cols = Case,
values_from = value,
names_from = cols,
values_fn = str_flatten)
Output
Case value
<chr> <chr>
1 Case1.1.jpeg thebigDOG10197
2 Case1.2.png framework
3 Case1.3.jpg BETHEChangeYOUWANTTOSEEintheWORLD
4 Case1.4.png 09.80.56.60.77
A possible solution:
library(tidyverse)
df %>%
pivot_longer(everything()) %>%
mutate(name = str_remove(name, "\\.text\\.*\\d*")) %>%
group_by(name) %>%
summarise(text = str_c(value, collapse = ""))
#> # A tibble: 4 x 2
#> name text
#> <chr> <chr>
#> 1 Case1.1.jpeg thebigDOG10197
#> 2 Case1.2.png framework
#> 3 Case1.3.jpg BETHEChangeYOUWANTTOSEEintheWORLD
#> 4 Case1.4.png 09.80.56.60.77
An option in base R is stack the data into a two column data.frame with stack and then do a group by paste with aggregate
aggregate(cbind(Text = values) ~ Case, transform(stack(df),
Case = trimws(ind, whitespace = "\\.text.*")), FUN = paste, collapse = "")
Case Text
1 Case1.1.jpeg thebigDOG10197
2 Case1.2.png framework
3 Case1.3.jpg BETHEChangeYOUWANTTOSEEintheWORLD
4 Case1.4.png 09.80.56.60.77
You can use pivot_longer(everything()), manipulate the "Case" column, group, and paste together:
pivot_longer(df,everything(),names_to="Case") %>%
mutate(Case = str_remove_all(Case, ".text.*")) %>%
group_by(Case) %>% summarize(Text=paste(value, collapse=""))
Output:
Case Text
<chr> <chr>
1 Case1.1.jpeg thebigDOG10197
2 Case1.2.png framework
3 Case1.3.jpg BETHEChangeYOUWANTTOSEEintheWORLD
4 Case1.4.png 09.80.56.60.77

What is the tidyverse way to apply a function designed to take data.frames as input across a grouped tibble in R?

I've written a function that takes multiple columns as its input that I'd like to apply to a grouped tibble, and I think that something with purrr::map might be the right approach, but I don't understand what the appropriate input is for the various map functions. Here's a dummy example:
myFun <- function(DF){
DF %>% mutate(MyOut = (A * B)) %>% pull(MyOut) %>% sum()
}
MyDF <- data.frame(A = 1:5, B = 6:10)
myFun(MyDF)
This works fine. But what if I want to add some grouping?
MyDF <- data.frame(A = 1:100, B = 1:100, Fruit = rep(c("Apple", "Mango"), each = 50))
MyDF %>% group_by(Fruit) %>% summarize(MyVal = myFun(.))
This doesn't work. I get the same value for every group in my data.frame or tibble. I then tried using something with purrr:
MyDF %>% group_by(Fruit) %>% map(.f = myFun)
Apparently, that's expecting character data as input, so that's not it.
This next variation is basically what I need, but the output is a list of lists rather than a tibble with one row for each value of Fruit:
MyDF %>% group_by(Fruit) %>% group_map(~ myFun(.))
We can use the OP's function in group_modify
library(dplyr)
MyDF %>%
group_by(Fruit) %>%
group_modify(~ .x %>%
summarise(MyVal = myFun(.x))) %>%
ungroup
-output
# A tibble: 2 × 2
Fruit MyVal
<chr> <int>
1 Apple 42925
2 Mango 295425
Or in group_map where the .y is the grouping column
MyDF %>%
group_by(Fruit) %>%
group_map(~ bind_cols(.y, MyVal = myFun(.))) %>%
bind_rows
# A tibble: 2 × 2
Fruit MyVal
<chr> <int>
1 Apple 42925
2 Mango 295425

Finding the first row after which x rows meet some criterium in R

A data wrangling question:
I have a dataframe of hourly animal tracking points with columns for id, time, and whether the animal is on land or in water (0 = water; 1 = land). It looks something like this:
set.seed(13)
n <- 100
dat <- data.frame(id = rep(1:5, each = 10),
datetime=seq(as.POSIXct("2020-12-26 00:00:00"), as.POSIXct("2020-12-30 3:00:00"), by = "hour"),
land = sample(0:1, n, replace = TRUE))
What I need to do is flag the first row after which the animal uses land at least once for 3 straight days. I tried doing something like this:
dat$ymd <- ymd(dat$datetime[1]) # make column for year-month-day
# add land points within each id group
land.pts <- dat %>%
group_by(id, ymd) %>%
arrange(id, datetime) %>%
drop_na(land) %>%
mutate(all.land = cumsum(land))
#flag days that have any land points
flag <- land.pts %>%
group_by(id, ymd) %>%
arrange(id, datetime) %>%
slice(n()) %>%
mutate(flag = if_else(all.land == 0,0,1))
# Combine flagged dataframe with full dataframe
comb <- left_join(land.pts, flag)
comb[is.na(comb)] <- 1
and then I tried this:
x = comb %>%
group_by(id) %>%
arrange(id, datetime) %>%
mutate(time.land=ifelse(land==0 | is.na(lag(land)) | lag(land)==0 | flag==0,
0,
difftime(datetime, lag(datetime), units="days")))
But I still can't quite wrap my head around what to do to make it so that I can figure out when the animal has been on land at least once for three days straight, and then flag that first point on land. Thanks so much for any help you can provide!
Create a date column from the timestamp. Summarise the data and keep only 1 row for each id and date which shows whether the animal was on land even once in the entire day.
Use zoo's rollapply function to mark the first day as TRUE if the next 3 days the animal was on land.
library(dplyr)
library(zoo)
dat <- dat %>% mutate(date = as.Date(datetime))
dat %>%
group_by(id, date) %>%
summarise(on_land = any(land == 1)) %>%
mutate(consec_three = rollapply(on_land, 3,all, align = 'left', fill = NA)) %>%
ungroup %>%
#If you want all the rows of the data
left_join(dat, by = c('id', 'date'))

Mutate new column by comparing multiple column values in different data frame in r

I have a data frame DF in which I want to insert new column called Stage by comparing with the data frame DF1 columns Col1,Col2,Col3,Col4,Col5,Col6. Below is my sample data format
Col1=c("ABCD","","","","wxyz","")
Col2=c("","","MTNL","","","")
Col3=c("","PQRS","","","","")
Col4=c("","","","","","")
Col5=c("","","","","","")
Col6=c("","","","","","EFGH")
DF=data.frame(Col1,Col2,Col3,Col4,Col5,Col6)
Style=c("ABCD","WXYZ","PQRS","EFGH")
DF1=data.frame(Style)
Stage=c(1,1,3,6)
DFR=data.frame(Style,Stage)
DFR would be my resulting data frame.
Can Some one help me to solve this.
A tidyverse method:
library(tidyverse)
DFR <- DF %>%
mutate(across(everything(), ~na_if(., ""))) %>%
pivot_longer(cols = everything(),
names_to = "Stage",
values_to = "Style",
values_drop_na = T) %>%
filter(Style %in% c("ABCD","WXYZ","PQRS","EFGH"))%>%
mutate(Stage = as.integer(gsub("Col", "", Stage)))
The first mutate call replaces your blank values with NA. Then I pivot your table to long format and drop NA values, before filtering for only the Style values you're interested in (these can be saved in a vector instead to make the code cleaner, but here the column and your vector are named the same so I didn't want to make it confusing). The second mutate call is optional, it removes "Col" from each of your Stage values and converts the column to the type integer.
You can join the data after getting it into long format.
library(dplyr)
library(tidyr)
DF %>%
pivot_longer(cols = everything()) %>%
right_join(DF1, by = c('value' = 'Style'))
# name value
# <chr> <chr>
#1 Col1 ABCD
#2 Col3 PQRS
#3 Col6 EFGH
#4 NA WXYZ
I tried to solve this by below way and it is working
DF <- DF %>%
mutate(across(everything(), ~na_if(., "")))
DFR=DF1
DFR$Stage=ifelse(is.na(DF1$Style),NA,ifelse(DF1$Style %in% DF$Col1,1,
ifelse(DF1$Style %in% DF$Col2,2,
ifelse(DF1$Style %in% DF$Col3,3,
ifelse(DF1$Style %in% DF$Col4,4,
ifelse(DF1$Style %in% DF$Col5,5,
ifelse(DF1$Style %in% DF$Col6,6,NA)))))))

Detect string with multiple conditions and mutate column in r

I have Lookup_DF which contains dictionary to refer strings and Raw_file which has combination of strings, Lookup_DF is having Types to populate in Result data frame based on Items in raw files.
Item1=c("Banana","Toamto","Potato","Palak")
Item2=c("","Orange","Onion","Mango")
Type1=c("Fruit","Vegetable","Vegetable","Leaves")
Type2=c("","Fruit","Vegetable","Fruit")
DF1=data.frame(Item1,Item2,Type1,Type2)
Items=c("Onion,Potato,Ginger","Tomato","Banana","Palak,Mango","Onion,Capsicum","Orange,Sweet_potato")
Raw_file=data.frame(Items)
Result_Type1=c("Vegetable","Vegetable","Fruit","Leaves","","")
Result_Type2=c("Vegetable","","","Fruit","Vegetable","Fruit")
Result=data.frame(Items,Result_Type1,Result_Type2)
My Output data frame would look like Result.
I tried something with str_detect in case statement but not able to get it. Can someone help me out on this.
Maybe you can do a join between these two tables (similar to your other question).
First would put DF1 in long format. For Raw_file, use separate_rows to have a single item in each row before the join.
library(tidyverse)
DF1_long <- DF1 %>%
pivot_longer(cols = everything(),
names_to = c(".value", "number"),
names_pattern = "(\\w+)(\\d+)$")
Raw_file %>%
mutate(value = Items) %>%
separate_rows(value) %>%
inner_join(DF1_long, by = c("value" = "Item")) %>%
group_by(Items) %>%
distinct(Items, number, .keep_all = TRUE) %>%
pivot_wider(id_cols = Items,
names_from = number,
values_from = Type,
names_prefix = "Result_Type")
Output
Items Result_Type2 Result_Type1
<chr> <chr> <chr>
1 Onion,Potato,Ginger Vegetable Vegetable
2 Tomato NA Vegetable
3 Banana NA Fruit
4 Palak,Mango Fruit Leaves
5 Onion,Capsicum Vegetable NA
6 Orange,Sweet_potato Fruit NA

Resources