Conversion of data (one column) with a stringr

Conversion of data (one column) with a stringr - r

I have a table called DATA_SET.This table contains one column with six different
cases of data.
#DATA_SET
DATA_SET<-data.frame(
CUSTOMS_RATE=c("20","15+0,41 eur/kg","10+0,1 eur/kg max.17","0,1
eur/l max.17","0,04 eur/kg max.10","NA")
)
View(DATA_SET)
#DATA_SET1
DATA_SET1<-data.frame(
RATE="",
SPECIFIC_RATE="",
MAXIMUM_RATE=""
)
So my intention is to divide this column into three different columns in order to continue with other statistical operations (calculation of averages, etc.) like table (DATA_SET 1) below.
So can anybody help me how to transform this table ?

Usually, separate would be a better option, but in this case, the positions of the numbers are not the same in each row, (sometimes missing too). So, we are using str_extract to individually extract the values
library(tidyverse)
DATA_SET %>%
mutate(CUSTOMS_RATE = str_replace_all(CUSTOMS_RATE, ",", "."),
RATE = str_extract(CUSTOMS_RATE, "^[0-9]+(?=\\+|$)"),
SPECIFIC_RATE = str_extract(CUSTOMS_RATE, "\\d+\\.\\d+"),
MAXIMUM_RATE = str_extract(CUSTOMS_RATE, "(?<=max\\.)\\d+")) %>%
select(2:4) %>%
mutate_all(as.numeric)
# RATE SPECIFIC_RATE MAXIMUM_RATE
#1 20 <NA> <NA>
#2 15 0.41 <NA>
#3 10 0.1 17
#4 <NA> 0.1 17
#5 <NA> 0.04 10
#6 <NA> <NA> <NA>
Or use str_replace to create a single delimiter and then use separate
DATA_SET %>%
mutate(CUSTOMS_RATE = str_replace_all(CUSTOMS_RATE, ",", ".") %>%
str_replace("\\+?([0-9]+\\.[0-9]+)", "+\\1") %>%
str_replace_all("[A-Za-z/ ]+\\.?", "+")) %>%
separate(CUSTOMS_RATE, into = c("RATE", "SPECIFIC_RATE", "MAXIMUM_RATE"),
sep="\\+", convert = TRUE)

Related

How to split one column whith multiples delimiters in multiple columns in R?

I have values with the following structure: string OR string_string.interger
EX:
df<-data.frame(Objs=c("Windows","Door_XYZ.1", "Door_XYY.1", "Chair_XYYU.2" ))
Objs
Windows
Door_XYZ.1
Door_XYY.1
Chair_XYYU.2
Using the command split(), separate() or something similar I need to generate a dataframe similar to this one:
Obs: The split must be performed for the characters "_" and "."
Objs
IND
TAG
Control
Windows
NA
NA
NA
Door_XYZ.1
Door
XYZ
1
Door_XYY.1
Door
XYY
1
Chair_XYYU.2
Chair
XYYU
2
The closest solution was suggested by #Tommy, in similar context.
df %>% data.frame(.,do.call(rbind,str_split(.$Objs,"_")))

The default value of the sep argument in separate() will nearly get the result you need. A conditional mutate was also needed to remove the Windows entry from the IND column.
library(tidyverse)
df <- data.frame(Objs=c("Windows","Door_XYZ.1", "Door_XYY.1", "Chair_XYYU.2" ))
df %>%
separate(Objs, into = c("IND", "TAG", "Control"), remove = FALSE, fill = "right") %>%
mutate(IND = if_else(Objs == IND, NA_character_, IND))
#> Objs IND TAG Control
#> 1 Windows <NA> <NA> <NA>
#> 2 Door_XYZ.1 Door XYZ 1
#> 3 Door_XYY.1 Door XYY 1
#> 4 Chair_XYYU.2 Chair XYYU 2
Created on 2022-05-05 by the reprex package (v1.0.0)

Randomly select 1 value from rows with multiple values; create a new column with cleaned data

Seems like a simple function but cannot seem to find a good way to do it on R. I have a column, P, that has many rows with multiple inputs:
P:
[340000, 410000]
[450000, 450000]
530000
110000
[330000, 440000]
510000
440000
620000
320000
Desired P1 (the * values should be randomly selected): (apologies for the spacing, the spacing is just so each value is a different line)
340000*
450000*
530000
110000
440000*
510000
440000
620000
320000
I want to build a new column that randomly selects 1 value from every row vector starting with "[" in column P and then spits out a new column, P1, with the corrected values+the other independent row values. This is part of a larger effort to clean the column so it is usable for regression.
Right now, I've come up with this tidyverse code as the best option for mutating :
foo <- data.frame(P=="[")
foo %>%
rowwise %>%
mutate(P1 = sample(P, 1))
But this isn't returning the output I need. Asside from sample(), I'm not sure what else can be used for random selection from a [] vector. I'm wondering what the best way to go about this would be?? Appreciate the help.

You can remove [] from the column values, split data on comma and get each value in different row. For each row you can then select 1 random value.
library(dplyr)
df %>%
mutate(P1 = gsub('\\[|\\]', '', P),
row = row_number()) %>%
tidyr::separate_rows(P1, sep = ',\\s*') %>%
group_by(row) %>%
slice_sample(n = 1) %>%
#In older version of dplyr use sample_n
#sample_n(1)
ungroup %>%
select(-row)
# P P1
# <chr> <chr>
#1 [340000, 410000] 340000
#2 [450000, 450000] 450000
#3 530000 530000
#4 110000 110000
#5 [330000, 440000] 440000
#6 510000 510000
#7 440000 440000
#8 620000 620000
#9 320000 320000
In base R you can implement the same logic with
df$P1 <- sapply(strsplit(gsub('\\[|\\]', '', df$P), ',\\s*'), sample, 1)
data
df <- structure(list(P = c("[340000, 410000]", "[450000, 450000]",
"530000", "110000", "[330000, 440000]", "510000", "440000", "620000",
"320000")), class = "data.frame", row.names = c(NA, -9L))

Melt Data and fill new column with desired data

Hello coding community
I have a two part question that is 1/2 answered
transpose, aka melt data frame, to my liking - done
add rows of data based on results found in "removed" column, a column created in the transposing step - stuck here
df<- read.table("https://pastebin.com/raw/NEPcUG01",header=T, sep="\t")
df_transformed<-tidyr::gather(df, day, removed, -(1:2), na.rm = TRUE) # melted data
In my example here (df), I have an experiment ran over 8 days. On certain days, I remove data points, and I am only interested in these days (hence why I added na.rm = TRUE in the transposing process). I sometimes remove 1 data point, or 4 (but this could be any number really)
I would like the removed data points to be called "individuals", and for them to be counted in chronological order. Therefore, I first need to add a column called "individuals"
df_transformed$individual <- ""
I would like to fill in the "individual" column based on the results in the "removed" column.
example: cage 2 had only 1 data point removed, and it was on day_8. I would therefore like to add, in the "individual" column, a 1. Cage 4, on the other hand, had data points removed on day_5 (1 data point) and day_7 (3 data points), for a total of 4 data points , aka , 4 "individuals". Therefore, Cage 4, when starting with day_5, I would like to add a 1 in the "individuals" column, and for day 7, create 3 total rows of data, and continue my "individual count" with 2,3,4. IF day_8 had 3 more data points removed, the individual count would continue with 5,6,7.
My desired result for my example data set today would be this:
desired_results <- read.table("https://pastebin.com/raw/r7QrC0y3", header=T, sep="\t") # 68 total rows of data
Interesting piece of information: The total number of rows in my final data set should equal the sum of all removed data points:
sum(df_transformed$removed) # 68
Thank you StackOverflow community. Looking forward to seeing the results.

We can use complete to create a sequence from 1 to each individual grouped by cage and day. We then fill the NA values in columns experiment and removed.
library(dplyr)
library(tidyr)
df_transformed %>%
mutate(individual = removed) %>%
group_by(cage, day) %>%
complete(individual = seq_len(individual)) %>%
fill(experiment, removed, .direction = "up")
# cage day individual experiment removed
#1 2 day_8 1 sugar 1
#2 3 day_5 1 sugar 1
#3 4 day_5 1 sugar 3
#4 4 day_5 2 sugar 3
#5 4 day_5 3 sugar 3
#6 4 day_7 1 sugar 1
#7 7 day_7 1 sugar 1
#8 7 day_8 1 sugar 1
#9 8 day_5 1 sugar 2
#10 8 day_5 2 sugar 2
# … with 58 more rows
To update individual only based on cage we can do
df_transformed %>%
mutate(individual = removed) %>%
group_by(cage, day) %>%
complete(individual = seq_len(individual)) %>%
group_by(cage) %>%
mutate(individual = row_number()) %>%
fill(experiment, removed, .direction = "up")

I think the following bit of code does what you need:
library(tidyverse)
read.table("https://pastebin.com/raw/NEPcUG01",header=T, sep="\t") %>%
pivot_longer(starts_with("day_"), names_to = "day", values_to = "removed") %>%
# drop_na() %>%
group_by(cage) %>%
summarize(individual = sum(removed, na.rm = TRUE))
I have used the pipe operator (%>%), which enables cleaner syntax. I have also used the newer pivot_longer function instead of gather. Then, grouping by cage and later summing over the individual column with summarize you get how many individuals were removed per cage.
I checked the sum of all the individuals and it seems to work:
read.table("https://pastebin.com/raw/NEPcUG01",header=T, sep="\t") %>%
pivot_longer(starts_with("day_"), names_to = "day", values_to = "removed") %>%
# drop_na() %>%
group_by(cage) %>%
summarize(individual = sum(removed, na.rm = TRUE)) %>%
pull(individual) %>%
sum()
#> [1] 68
The result is slightly different to your desired result. I am not 100% your desired result is actually correct... From your question, I understand that cage 4 should have 4 individuals, but in your desired_result it appears 4 times with values 1, 2, 3 and 4. The code I sent you generates a data frame where each appears in a single row.

Exchange certain patterns in one column with NA in another column in R

In the data with multiple columns, I want to exchange certain patterns in one column with NA in another column.
I have the data below:
data = data.frame(adverb=c('truly','extremely','wanted','happily','stressed'),verb=c('loved','adored','a','prayed','the'),article=c('you','the','toy',NA,'importance'),argument=c(NA,'doll',NA,NA,NA))
adverb verb article argument
1 truly loved you <NA>
2 extremely adored the doll
3 wanted a toy <NA>
4 happily prayed <NA> <NA>
5 stressed the importance <NA>
I want to relocate values in the data to the corresponding columns according to the patterns below.
adverb.pattern = '[a-z]+ly$'
verb.pattern = '[a-z]+ed$'
article.pattern = '(the)$|(a)$|(an)$'
argumen.pattern = '(you)$|(importance)$|(toy)$'
This is the desired output.
adverb verb article argument
1 truly loved <NA> you
2 extremely adored the doll
3 <NA> wanted a toy
4 happily prayed <NA> <NA>
5 <NA> stressed the importance

Here's a tidyverse solution:
# example data
data = data.frame(adverb=c('truly','extremely','wanted','happily','stressed'),
verb=c('loved','adored','a','prayed','the'),
article=c('you','the','toy',NA,'importance'),
argument=c(NA,'doll',NA,NA,NA),
stringsAsFactors = F)
library(tidyverse)
# specify patterns
adverb.pattern = '[a-z]+ly$'
verb.pattern = '[a-z]+ed$'
article.pattern = '(the)$|(a)$|(an)$'
argument.pattern = '(you)$|(importance)$|(toy)$'
data %>%
mutate(id = row_number()) %>% # add row id (useful to reshape)
gather(type, value, -id) %>% # reshape data
na.omit() %>% # remove rows with NAs
mutate(type_upd = case_when(grepl(adverb.pattern, value) ~ "adverb", # check patterns sequentially
grepl(verb.pattern, value) ~ "verb",
grepl(article.pattern, value) ~ "article",
grepl(argument.pattern, value) ~ "argument"),
type_upd = ifelse(is.na(type_upd), type, type_upd)) %>% # use original / initial type if updated type is NA
select(-type) %>% # remove old type
spread(type_upd, value) %>% # reshape data
select(adverb, verb, article, argument) # select column in this order
# adverb verb article argument
# 1 truly loved <NA> you
# 2 extremely adored the doll
# 3 <NA> wanted a toy
# 4 happily prayed <NA> <NA>
# 5 <NA> stressed the importance

Add synonyms from qdap to a preexisting dataframe in R

I have created the following dataframe df in R
Sl NO Word
1 get
2 Free
3 Joshi
4 Hello
5 New
I have used this code to get a list of synonyms but the same are in the form of a list
library(qdap)
synonyms(DF$Word)
I am getting a list of synonymous words for this. I Want to get the synonymous words for each word in the dataframe appended rowwise to the dataframe as separate columns.
DF<-
Sl NO Word Syn1 Syn2
1 get obtain receive
2 Free independent NA
3 Joshi NA NA
4 Hello Greeting NA
5 New Unused Fresh
Is there an elegant way to obtain this.Are there other dictionaries that can be used for this.

One approach could be to use mapply and pass each word at a time to qdap::synonyms. The result from 'synonyms' can be collapsed in a column using paste0 function with collapse = "|". Now data is ready.
Use tidyr::separate to separate columns into Syn1, Syn2 etc.
Note: synonyms is called with two arguments as return.list = FALSE, multiwords = FALSE
The below code has limit on maximum 10 synonyms but solution can be evolved to handle number dynamically.
library(tidyverse)
library(qdap)
df %>%
mutate(Synonyms =
mapply(function(x)paste0(
head(synonyms(x, return.list = FALSE, multiwords = FALSE),10), collapse = "|"),
tolower(.$Word))) %>%
separate(Synonyms, paste("Syn",1:10), sep = "\\|", extra = "drop" )
Result:
# SlNO Word Syn 1 Syn 2 Syn 3 Syn 4 Syn 5 Syn 6 Syn 7 Syn 8 Syn 9 Syn 10
# 1 1 get achieve acquire attain bag bring earn fetch gain glean inherit
# 2 2 Free buckshee complimentary gratis gratuitous unpaid footloose independent liberated loose uncommitted
# 3 3 Joshi <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 4 4 Hello <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 5 5 New advanced all-singing all-dancing contemporary current different fresh ground-breaking happening latest
Data
df <- read.table(text =
"SlNO Word
1 get
2 Free
3 Joshi
4 Hello
5 New",
header = TRUE, stringsAsFactors = FALSE)

Here is another approach with splitstackshape::cSplit.
library(tidyverse)
library(qdap)
library(splitstackshape)
DF <- read.table(text = tt, header = T)
DF <- DF %>% mutate_at(vars(Word), tolower)
syns <- synonyms_frame(synonyms(tolower(DF$Word))) %>%
mutate_at(vars(x), funs(str_remove(x, "\\..*"))) %>%
mutate_at(vars(y), funs(str_extract(y, '[:alpha:]+'))) %>%
group_by(x) %>%
summarise(Syn = toString(y)) %>%
rename(Word = x) %>% cSplit('Syn')
left_join(DF, syns)

I am not sure how exactly would you like to add all synonyms of a word because when you run synonyms("get") it gives 75 definitions of get and I feel that the desired layout will not be of much help if you add all values of 75 definitions in a single row.
So in below solution I have selected the very first definition only.
library(qdap)
library(dplyr)
library(splitstackshape)
df %>%
rowwise() %>%
mutate(synonym_of_word = paste(synonyms(tolower(word))[[1]], collapse=",")) %>%
cSplit("synonym_of_word", ",")
Sample data:
df <- structure(list(sl_no = 1:5, word = c("get", "Free", "Joshi",
"Hello", "New")), .Names = c("sl_no", "word"), class = "data.frame", row.names = c(NA,
-5L))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Conversion of data (one column) with a stringr - r

Related

How to split one column whith multiples delimiters in multiple columns in R?

Randomly select 1 value from rows with multiple values; create a new column with cleaned data

Melt Data and fill new column with desired data

Exchange certain patterns in one column with NA in another column in R

Add synonyms from qdap to a preexisting dataframe in R

Categories

Resources