In R ,how can i split a long character - r

In R,there is a long character "mr" as blow,how can i split "mr" by number (split into three short strings):
mr <- 'total amount 25.36 expense -2 promotion discount-2.56'
# 'total amount 25.36','expense -2','promotion discount-2.56'

An option with tidyverse
library(dplyr)
library(tidyr)
tibble(col1 = mr) %>%
separate_rows(col1, sep="(?<=\\d) ") %>%
separate(col1, into = c("Description", "Amount"),
sep = "(?<=[a-z])\\s*(?=[-0-9])", convert = TRUE)
# A tibble: 3 x 2
# Description Amount
# <chr> <dbl>
#1 total amount 25.4
#2 expense -2
#3 promotion discount -2.56

Adding to #rawr 's comment,
If you want to have it as a data frame,
mr <- 'total amount 25.36 expense -2 promotion discount-2.56'
splt <- strsplit(mr, '(?<=\\d) ', perl = TRUE)[[1]]
df <- data.frame("Desciption" = gsub("[^a-z ]", "", splt),
"Amount" = as.numeric(gsub("[^0-9.-]", "", splt)))
df
Desciption Amount
1 total amount 25.36
2 expense -2.00
3 promotion discount -2.56

Related

How to pivot_wider the n unique values of variable A grouped_by variable B?

I am trying to pivot_wider() the column X of a data frame containing various persons names. Within group_by() another variable Y of the df there are always 2 of these names. I would like R to take the 2 unique X names values within each unique identifier of Y and put them in 2 new columns ex_X_Name_1 and ex_X_Name_2.
My data frame is looking like this:
df <- data.frame(Student = rep(c(17383, 16487, 17646, 2648, 3785), each = 2),
Referee = c("Paul Severe", "Cathy Nice", "Jean Exigeant", "Hilda Ehrlich", "John Rates",
"Eva Luates", "Fred Notebien", "Aldous Grading", "Hans Streng", "Anna Filaktic"),
Rating = format(round(x = sqrt(sample(15:95, 10, replace = TRUE)), digits = 3), nsmall = 3)
)
df
I would like to make the transformation of the Referee column to 2 new columns Referee_1 and Referee_2 with the 2 unique Referees assigned to each student and end with this result:
even_row_df <- as.logical(seq_len(length(df$Referee)) %% 2)
df_wanted <- data_frame(
Student = unique(df$Student),
Referee_1 = df$Referee[even_row_df],
Rating_Ref_1 = df$Rating[even_row_df],
Referee_2 = df$Referee[!even_row_df],
Rating_Ref_2 = df$Rating[!even_row_df]
)
df_wanted
I guess I could achieve this with by subsetting unique rows of student/referee combinations and make joins , but is there a way to handle this in one call to pivot_wider?
You should create a row id per group first:
library(dplyr)
library(tidyr)
df %>%
group_by(Student) %>%
mutate(row_n = row_number()) %>%
ungroup() %>%
pivot_wider(names_from = "row_n", values_from = c("Referee", "Rating"))
# A tibble: 5 × 5
Student Referee_1 Referee_2 Rating_1 Rating_2
<dbl> <chr> <chr> <chr> <chr>
1 17383 Paul Severe Cathy Nice 9.165 7.810
2 16487 Jean Exigeant Hilda Ehrlich 5.196 6.557
3 17646 John Rates Eva Luates 7.211 5.568
4 2648 Fred Notebien Aldous Grading 4.000 8.124
5 3785 Hans Streng Anna Filaktic 7.937 6.325
using data.table
library(data.table)
setDT(df)
merge(df[, .SD[1], Student], df[, .SD[2], Student], by = "Student", suffixes = c("_1", "_2"))
# Student Referee_1 Rating_1 Referee_2 Rating_2
# 1: 2648 Fred Notebien 6.708 Aldous Grading 9.747
# 2: 3785 Hans Streng 6.245 Anna Filaktic 8.775
# 3: 16487 Jean Exigeant 7.681 Hilda Ehrlich 4.359
# 4: 17383 Paul Severe 4.583 Cathy Nice 7.616
# 5: 17646 John Rates 6.708 Eva Luates 8.246

Cleaning Data: Multiple Misspelled Strings in R

I have over 100 strings that I want to change, for ex:
Scheduled Caste, Schdeduled Caste, Schedulded Caste need to be changed to SC.
I have been doing it like this: Haryana3$Category[Haryana3$Category%in% "Scheduled Caste"] <- "SC"
Is there anything I can do that's more efficient?
Use gsub
Haryana3$Category <- gsub("Scheduled Caste", "SC", Haryana3$Category)
You can use data.table and try the following:
library(data.table)
setDT(Haryana3)
Haryana3[, Catergory:= gsub("Scheduled Caste", "SC", Category)]
I guess the rule is combing all the first letter from each word. If that is true, here is one idea.
library(tidyverse)
Haryana3 <- Haryana3 %>%
mutate(Category = strsplit(Category, split = " ")) %>%
mutate(Category = map_chr(Category, ~paste0(str_sub(.x, start = 1L, end = 1L), collapse = "")))
Haryana3
# ID Category
# 1 1 SC
# 2 2 SC
# 3 3 ST
# 4 4 ST
# 5 5 FC
DATA
Haryana3 <- read.table(text = "ID Category
1 'Scheduled Caste'
2 'Scheduled Caste'
3 'Scheduled Tribes'
4 'Scheduled Tribes'
5 'Forward Caste'", header = TRUE)

How to tidy the data set with column containing multiple information-Sample data put?

Please help me make my data tidy. Thanks.
The total observations is 394, with 26 columns. Data is exported from ms excel.
Data sample is given below. In this sample actually there should be only three observations/rows.
In the vectors d1..d2..no and Farmer.Name the observations corresponding to NA of v1 should be cleared and added to the preceding row value.
the d1..d2..no corresponds to three observations (two date observations one unique identification number )and so do the Farmer.Name vector.
The sample is
d1..d2..no<-c("27/01/2020", "43832", "KE004421", "43832", "43832",
"KE003443", "31/12/2019", "43832", "KE0001512")
Farmer.Name<-c("S Jacob Gender:male","farmer type :marginal","farmer category :general",
"J Isac Gender :Female","farmer type: large","farmer category :general",
"P Kumar Gender :Male","farmer type:small","farmer category :general")
adress<-c("k11",NA,NA,"k12",NA,NA,"k13",NA,NA)
amount<-c(25,NA,NA,25,NA,NA,32,NA,NA)
mydata<-data.frame(v1=v1, d1..d2..no=d1..d2..no, Farmer.Name=Farmer.Name,
adress=adress, amount=amount)
In the vectors d1..d2..no and Farmer.Name the observations corresponding to NA of v1 should be cleared and added to the preceding row value.
the d1..d2..no corresponds to three observations (two date observations one unique identification number )
and so do the Farmer.Name vector. That is, my result expected is like from this code
v1<-c(1,2,3)
d1<-c("27/01/2020","43832","31/12/2019")
d2<-c("43832","43832","43832")
no<-c("KE004421","KE003443","KE0001512")
Farmer.Name1<-c("S Jacob","J Isac","P Kumar")
Gender<-c("male","female","male")
farmer_type <-c("marginal","large","small")
farmer_category <-c("general", "general", "general")
adress<-c("k11","k12","k13")
amount<-c(25,25,32)
myfinaldata<-data.frame(v1=v1,d1=d1,d2=d2,no=no,
Farmer.Name1=Farmer.Name1,
farmer_type=farmer_type,
farmer_category=farmer_category,
adress=adress,amount=amount)
The result should be
v1 d1 d2 no Farmer.Name1 farmer_type farmer_category adress amount
1 1 27/01/2020 43832 KE004421 S Jacob marginal general k11 25
2 2 43832 43832 KE003443 J Isac large general k12 25
3 3 31/12/2019 43832 KE0001512 P Kumar small general k13 32
I am a novice to programming and r, learning through online resources. Also my first post on this platform. Please forgive any mistakes.
I have done a lot of mess with spread,separate, etc of tidy vesre.. But stuck at how to proceed.
Untidy data can be a challenge. Here is a tidyverse approach.
First, added proposed column names expected for d1, d2, and no. Assumes rows are in this order.
Column Farmer.Name is separated into two columns, by :.
The Name itself is separated before the word Gender.
fill allows for common values to be filled in for the same individual (such as v1, adress, amount, and Name).
pivot_wider is done to spread the data wide, first, by d1, d2, and no, and then by the other columns including Gender, farmer_type, and farmer_category.
library(tidyverse)
df1 <- mydata %>%
mutate(d_var = rep(c("d1", "d2", "no"), times = 3)) %>%
separate(Farmer.Name, into = c("Var", "Val"), sep = ":") %>%
separate(Var, into = c("Name", "Var"), sep = "(?=Gender)", fill = "left") %>%
mutate_at(c("Name", "Var"), trimws) %>%
fill(v1, adress, amount, Name, .direction = "down") %>%
mutate(Var = gsub(" ", "_", Var))
df1 %>%
pivot_wider(id_cols = c(v1, Name, adress, amount), names_from = d_var, values_from = d1..d2..no) %>%
left_join(pivot_wider(df1, id_cols = c(v1, Name, adress, amount), names_from = Var, values_from = Val))
Output
# A tibble: 3 x 10
v1 Name adress amount d1 d2 no Gender farmer_type farmer_category
<dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 S Jacob k11 25 27/01/2020 43832 KE004421 male "marginal" general
2 2 J Isac k12 25 43832 43832 KE003443 Female " large" general
3 3 P Kumar k13 32 31/12/2019 43832 KE0001512 Male "small" general
The dates in your data set are not in date format. Consider formatting them after this.
library(reshape)
df.new <- cbind(mydata[seq(1, nrow(mydata), 3), ], mydata[seq(2, nrow(mydata), 3), ][2:3], mydata[seq(3, nrow(mydata), 3), ][2:3])
colnames(df.new) <- c("v1", "d1", "Farmer.Name1", "adress", "amount", "d2", "farmer_type", "no", "farmer_category")
df.new <- df.new[c(1,2,6, 8,3, 7,9, 4,5)]
library(stringr)
df.new$Farmer.Name1 <- word(df.new$Farmer.Name1,1,sep = "\\ Gender")
df.new$farmer_type <- word(df.new$farmer_type,2,sep = "\\:")
df.new$farmer_category <- word(df.new$farmer_category,2,sep = "\\:")
Final table:
> df.new
v1 d1 d2 no Farmer.Name1 farmer_type farmer_category adress amount
1 1 27/01/2020 43832 KE004421 S Jacob marginal general k11 25
4 2 43832 43832 KE003443 J Isac large general k12 25
7 3 31/12/2019 43832 KE0001512 P Kumar small general k13 32
P.S.: I have not renamed the row numbers.

Is there an R function to split the sentence

I have couple of unstructured sentences like below. Description below is column name
Description
Automatic lever for a machine
Vaccum chamber with additional spare
Glove box for R&D
The Mini Guage 5 sets
Vacuum chamber only
Automatic lever only
I want to split this sentence from Col1 to Col5 and count there occurrence like below
Col1 Col2 Col3 Col4
Automatic_lever lever_for for_a a_machine
Vaccum_chamber chamber_with with_additional additional_spare
Glove_box box_for for_R&D R&D
The_Mini Mini_Guage Guage_5 5_sets
Vacuum_chamber chamber_only only
Automatic_lever lever_only only
Also from above columns, can i have the occurence of these words. Like, Vaccum_chamber and Automatic_lever are repeated twice here. Similarly, the occurence of other words?
Here is a tidyverse option
df %>%
rowid_to_column("row") %>%
mutate(words = map(str_split(Description, " "), function(x) {
if (length(x) %% 2 == 0) words <- c(words, "")
idx <- 1:(length(words) - 1)
map_chr(idx, function(i) paste0(x[i:(i + 1)], collapse = "_"))
})) %>%
unnest() %>%
group_by(row) %>%
mutate(
words = str_replace(words, "_NA", ""),
col = paste0("Col", 1:n())) %>%
filter(words != "NA") %>%
spread(col, words, fill = "")
## A tibble: 6 x 6
## Groups: row [6]
# row Description Col1 Col2 Col3 Col4
# <int> <fct> <chr> <chr> <chr> <chr>
#1 1 Automatic lever for a mac… Automatic_… lever_for for_a a_machine
#2 2 Vaccum chamber with addit… Vaccum_cha… chamber_w… with_addi… additional…
#3 3 Glove box for R&D Glove_box box_for for_R&D R&D
#4 4 The Mini Guage 5 sets The_Mini Mini_Guage Guage_5 5_sets
#5 5 Vacuum chamber only Vacuum_cha… chamber_o… only ""
#6 6 Automatic lever only Automatic_… lever_only only ""
Explanation: We split the sentences in Description on a single whitespace " ", then concatenate every two words together with a sliding window approach, making sure that there are always an odd odd number of words per sentence; the rest is just a long-to-wide transformation.
Not pretty but it reproduces your expected output; instead of the manual sliding window approach you could also you zoo::rollapply.
Sample data
df <- read.table(text =
"Description
'Automatic lever for a machine'
'Vaccum chamber with additional spare'
'Glove box for R&D'
'The Mini Guage 5 sets'
'Vacuum chamber only'
'Automatic lever only'", header = T)

R data splitting unicodes

I have a data and want to split into columns
price_list <- c("Vegetables", " Garlic Desi<U+062A><U+06BE><U+0648><U+0645> <U+062F><U+06CC><U+0633><U+06CC> 140 per kg ",
" Fresh-bean<U+0641><U+0631><U+0627><U+0634><U+0628><U+06CC><U+0646> — per kg ",
"Fruits",
" Apple Kala Kolu Irani<U+0633><U+06CC><U+0628> <U+06A9><U+0627><U+0644><U+0627> <U+06A9><U+0648><U+0644><U+0648> <U+0627><U+06CC><U+0631><U+0627><U+0646><U+06CC> 168 per kg ",
" Apple golden 115 per kg ",
" Banana (I)<U+06A9><U+06CC><U+0644><U+0627> <U+0627><U+0646><U+0688><U+06CC><U+0646> 182 per dozen ",
"Others",
" Chicken<U+0645><U+0631><U+063A><U+06CC> <U+0634><U+06CC><U+0648><U+0631> 170 per kg ",
" Egg<U+0627><U+0646><U+0688><U+06D2> <U+0634><U+06CC><U+0648><U+0631> 95 per dozen "
)
tried but Unicodes creating problem
library(stringr)
regexp <- "[[:digit:]]+"
rprice <- str_extract(df$price_list, regexp)
df$price <- data.frame(rprice)
Desired out put like
Name Unicode Price Quantity
Vegetables
Fresh-bean فراشبین NA kg
Fruits
Apple golden NA 115 kg
Others
Egg انڈے شیور NA dozen
This forum is really helpful saved hundred and thousands of hours thanks
url <- "https://ictadministration.gov.pk/services/price-list/
complete code
library(rvest)
scraping_wiki <- read_html("https://ictadministration.gov.pk/services/price-list/")
library(magrittr)
price_date <- scraping_wiki %>%
html_nodes(".tm-article-content > ol:nth-child(1) > div:nth-child(1)") %>%
html_text()%>%
strsplit(split = "\n") %>%
unlist() %>%
.[. != ""]
price_date <- gsub(":", "", price_date)
price_list <- scraping_wiki %>%
html_nodes(".xl-tbl") %>%
html_text() %>%
strsplit(split = "\n") %>%
unlist() %>%
.[. != ""]
Wow, messy. This gets you close:
library(dplyr)
library(stringr)
unis <- price_list %>% str_extract(pattern = "<[[:print:]]*>")
words <- price_list %>% str_extract(pattern = "[A-Z a-z<]*") %>% gsub("<U", "", x = .)
price <- price_list %>% str_extract(pattern = "[0-9]* per") %>% gsub("per", "", x = .)
quant <- price_list %>% str_extract(pattern = "per [a-z]*")
df <- tibble(Name = words, Unicode = unis, Price = price, Quantity = quant)
Result:
> head(df)
# A tibble: 6 x 4
Name Unicode Price Quantity
<chr> <chr> <chr> <chr>
1 Vegetables NA NA NA
2 " Garlic Desi" <U+062A><U+06BE><U+0648><U+0645> <U+062F><U+06CC><U+0633><U+06CC> "140~ per kg
3 " Fresh" <U+0641><U+0631><U+0627><U+0634><U+0628><U+06CC><U+0646> " " per kg
4 Fruits NA NA NA
5 " Apple Kala Kolu Irani" <U+0633><U+06CC><U+0628> <U+06A9><U+0627><U+0644><U+0627> <U+06A9><U+~ "168~ per kg
6 " Apple golden " NA "115~ per kg
I'm not a regex genius, so I'm sure there must be a cleaner way.
Here's a functional approach. It's always good to learn to find a work around with functions.
Following are the steps:
1. Clean the price_list and keep the name, number and quantity.
2. Write functions which does that.
3. Apply functions on the new data frame.
# clean text
clean_list <- lapply(price_list, function(i) gsub("<[^>]+>", "",i))
clean_list <- lapply(clean_list, function(i) gsub('per','',i))
clean_list <- lapply(clean_list, str_trim)
# convert list to data frame
df <- data.table(do.call('rbind', clean_list))
colnames(df) <- 'text'
# helper functions
get_number <- function(j)
{
p1 <- unlist(strsplit(j, ' '))
p2 <- grepl('\\d+',p1)
if(sum(as.integer(p2)) ==1) return (grep('\\d+',p1,value = T))
else return (0)
}
get_quantity <- function(j)
{
p1 <- unlist(strsplit(j, ' '))
p2 <- grepl('kg|dozen',p1)
if(sum(as.integer(p2)) ==1) return (grep('kg|dozen',p1,value = T))
else return (NA)
}
# apply functions and get output
df[,Name := sapply(text, function(i) unlist(strsplit(i, ' '))[1])]
df[,Price := sapply(text, get_number)]
df[,Quantity := sapply(text, get_quantity)]
df[,Unicode := sapply(price_list, function(x) str_extract(string = x, pattern = '<[[:print:]]*>'))]
head(df)
text Name Price Quantity Unicode
1 Vegetables Vegetables 0 NA NA
2 Garlic Desi 140 kg Garlic Desi 140 kg <U+062A><U+06BE><U+0648><U+0645> <U+062F><U+06CC><U+0633><U+06CC>
3 Fresh-bean — kg Fresh-bean 0 kg <U+0641><U+0631><U+0627><U+0634><U+0628><U+06CC><U+0646>
4 Fruits Fruits 0 NA NA
5 Apple Kala Kolu Irani 168 kg Apple Kala Kolu Irani 168 kg <U+0633><U+06CC><U+0628> <U+06A9><U+0627><U+0644><U+0627> <U+06A9><U+0648><U+0644><…
6 Apple golden 115 kg Apple golden 115 kg NA
>

Resources