Cleaning Data: Multiple Misspelled Strings in R

Cleaning Data: Multiple Misspelled Strings in R - r

I have over 100 strings that I want to change, for ex:
Scheduled Caste, Schdeduled Caste, Schedulded Caste need to be changed to SC.
I have been doing it like this: Haryana3$Category[Haryana3$Category%in% "Scheduled Caste"] <- "SC"
Is there anything I can do that's more efficient?

Use gsub
Haryana3$Category <- gsub("Scheduled Caste", "SC", Haryana3$Category)
You can use data.table and try the following:
library(data.table)
setDT(Haryana3)
Haryana3[, Catergory:= gsub("Scheduled Caste", "SC", Category)]

I guess the rule is combing all the first letter from each word. If that is true, here is one idea.
library(tidyverse)
Haryana3 <- Haryana3 %>%
mutate(Category = strsplit(Category, split = " ")) %>%
mutate(Category = map_chr(Category, ~paste0(str_sub(.x, start = 1L, end = 1L), collapse = "")))
Haryana3
# ID Category
# 1 1 SC
# 2 2 SC
# 3 3 ST
# 4 4 ST
# 5 5 FC
DATA
Haryana3 <- read.table(text = "ID Category
1 'Scheduled Caste'
2 'Scheduled Caste'
3 'Scheduled Tribes'
4 'Scheduled Tribes'
5 'Forward Caste'", header = TRUE)

Related

R: Pulling data from one column to create new columns

I have data with sample names that need to be unpacked and created into new columns.
sample
P10.1
P11.2
S1.1
S3.3
Using the sample ID data, I need to make three new columns: tissue, plant, stage.
sample tissue plant stage
P10.1 P 10 1
P11.2 P 11 2
S1.1 S 1 1
S3.3 S 3 3
Is there a way to pull the data from the sample column to populate the three new columns?

using dplyr and tidyr.
First we insert a "." in the sample code, next we separate sample into 3 columns.
library(dplyr)
library(tidyr)
df %>%
mutate(sample = paste0(substring(df$sample, 1, 1), ".", substring(df$sample, 2))) %>%
separate(sample, into = c("tissue", "plant", "stage"), remove = FALSE)
sample tissue plant stage
1 P.10.1 P 10 1
2 P.11.2 P 11 2
3 S.1.1 S 1 1
4 S.3.3 S 3 3
data:
df <- structure(list(sample = c("P10.1", "P11.2", "S1.1", "S3.3")),
.Names = "sample",
class = "data.frame",
row.names = c(NA, -4L))

Similar to #phiver, but uses regular expressions.
Within pattern:
The first parentheses captures any single uppercase letter (for tissue)
The second parentheses captures any one or two digit number (for plant)
The third parentheses captures any one or two digit number (for stage)
The sub() function pulls out those capturing groups, and places then in new variables.
library(magrittr)
pattern <- "^([A-Z])(\\d{1,2})\\.(\\d{1,2})$"
df %>%
dplyr::mutate(
tissue = sub(pattern, "\\1", sample),
plant = as.integer(sub(pattern, "\\2", sample)),
stage = as.integer(sub(pattern, "\\3", sample))
)
Result (displayed with str()):
'data.frame': 4 obs. of 4 variables:
$ sample: chr "P10.1" "P11.2" "S1.1" "S3.3"
$ tissue: chr "P" "P" "S" "S"
$ plant : int 10 11 1 3
$ stage : int 1 2 1 3

This is similar to phiver's answer, but use separate twice. Notice that we can specify the position index in the sep argument.
library(tidyr)
dat2 <- dat %>%
separate(sample, into = c("tissue", "number"), sep = 1, remove = FALSE) %>%
separate(number, into = c("plant", "stage"), sep = "\\.", remove = TRUE, convert = TRUE)
dat2
# sample tissue plant stage
# 1 P10.1 P 10 1
# 2 P11.2 P 11 2
# 3 S1.1 S 1 1
# 4 S3.3 S 3 3
DATA
dat <- read.table(text = "sample
P10.1
P11.2
S1.1
S3.3",
header = TRUE, stringsAsFactors = FALSE)

How to add value base on specific character ,also fix with certain digits in R

There is the basic width : xxxx.xxxxxx (4digits before "." 6 digits after".")
Have to add "0" when each side before and after "." is not enough digits.
Use regexr find "[.]" location with combination of str_pad can
fix the first 4 digits but
don't know how to add value after the specific character with fixed digits.
(cannot find a library can count the location from somewhere specified)
Data like this
> df
Category
1 300.030340
2 3400.040290
3 700.07011
4 1700.0901
5 700.070114
6 700.0791
7 3600.05059
8 4400.0402
Desired data
> df
Category
1 0300.030340
2 3400.040290
3 0700.070110
4 1700.090100
5 0700.070114
6 0700.079100
7 3600.050590
8 4400.040200
I am beginner of coding that sometime can't understand some regex like "["
e.t.c .With some explain of them would be super helpful.
Also i have a combination like this :
df$Category<-ifelse(regexpr("[.]",df$Category)==4,
paste("0",df1$Category,sep = ""),df$Category)
df$Category<-str_pad(df$Category,11,side = c("right"),pad="0")
Desire to know are there is any better way do this , especially count and
return the location from the END until specific character appear.

Using formatC:
df$Category <- formatC(as.numeric(df$Category), format = 'f', width = 11, flag = '0', digits = 6)
# > df
# Category
# 1 0300.030340
# 2 3400.040290
# 3 0700.070110
# 4 1700.090100
# 5 0700.070114
# 6 0700.079100
# 7 3600.050590
# 8 4400.040200
format = 'f': formating doubles;
width = 11: 4 digits before . + 1 . + 6 digits after .;
flag = '0': pads leading zeros;
digits = 6: the desired number of digits after the decimal point (format = "f");
Input df seems to be character data.frame:
structure(list(Category = c("300.030340", "3400.040290", "700.07011",
"1700.0901", "700.070114", "700.0791", "3600.05059", "4400.0402"
)), .Names = "Category", row.names = c(NA, -8L), class = "data.frame")

We can use sprintf
df$Category <- sprintf("%011.6f", df$Category)
df
# Category
#1 0300.030340
#2 3400.040290
#3 0700.070110
#4 1700.090100
#5 0700.070114
#6 0700.079100
#7 3600.050590
#8 4400.040200
data
df <- structure(list(Category = c(300.03034, 3400.04029, 700.07011,
1700.0901, 700.070114, 700.0791, 3600.05059, 4400.0402)),
.Names = "Category", class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))

There are plenty of great tricks, functions, and shortcuts to be learned, and I would encourage you to explore them all! For example, if you're trying to win code golf, you will want to use #akrun's sprintf() approach. Since you stated you're a beginner, it might be more helpful to breakdown the problem into its component parts. One transparent and easy-to-follow, in my opinion, approach would be to utilize the stringr package:
library(stringr)
location_of_dot <- str_locate(df$Category, "\\.")[, 1]
substring_left_of_dot <- str_sub(df$Category, end = location_of_dot - 1)
substring_right_of_dot <- str_sub(df$Category, start = location_of_dot + 1)
pad_left <- str_pad(substring_left_of_dot, 4, side = "left", pad = "0")
pad_right <- str_pad(substring_right_of_dot, 6, side = "right", pad = "0")
result <- paste0(pad_left, ".", pad_right)
result

Use separate in tidyr to separate Category on decimal. Use str_pad from stringr to add zeros in the front or back and paste them together.
library(tidyr) # to separate columns on decimal
library(dplyr) # to mutate and pipes
library(stringr) # to strpad
input_data <- read.table(text =" Category
1 300.030340
2 3400.040290
3 700.07011
4 1700.0901
5 700.070114
6 700.0791
7 3600.05059
8 4400.0402", header = TRUE, stringsAsFactors = FALSE) %>%
separate(Category, into = c("col1", "col2")) %>%
mutate(col1 = str_pad(col1, width = 4, side= "left", pad ="0"),
col2 = str_pad(col2, width = 6, side= "right", pad ="0"),
Category = paste(col1, col2, sep = ".")) %>%
select(-col1, -col2)

Checking if keyword in one table is within a string in another table using R

I've been trying to solve this issue with mapply, but I believe I will have to use several nested applies to make this work, and it has gotten real confusing.
The problem is as follows:
Dataframe one contains around 400 keywords. These fall into roughly 15 categories.
Dataframe two contains a string description field, and 15 additional columns, each named to correspond to the categories mentioned in dataframe one. This has millions of rows.
If a keyword from dataframe 1 exists in the string field in dataframe 2, the category in which the keyword exists should be flagged in dataframe 2.
What I want should look something like this:
> #Dataframe1 df1
>> keyword category
>> cat A
>> dog A
>> pig A
>> crow B
>> pigeon B
>> hawk B
>> catfish C
>> carp C
>> ...
>>
> #Dataframe2 df2
>> description A B C ....
>> false cat 1 0 0 ....
>> smiling pig 1 0 0 ....
>> shady pigeon 0 1 0 ....
>> dogged dog 2 0 0 ....
>> sad catfish 0 0 1 ....
>> hawkward carp 0 1 1 ....
>> ....
I tried to use mapply to get this to work but it fails, giving me the error "longer argument not a multiple of length of shorter". It also computes this only for the first string in df2. I haven't proceeded beyond this stage, i.e. attempting to get category flags.
> mapply(grepl, pattern = df1$keyword, x = df2$description)
Could anyone be of help? I thank you very much. I am new to R so it would also help if someone could mention some 'thumb rules' for turning loops into apply functions. I cannot afford to use loops to solve this as it would take way too much time.

There might be a more elegant way to do this but this is what I came up with:
## Your sample data:
df1 <- structure(list(keyword = c("cat", "dog", "pig", "crow", "pigeon", "hawk", "catfish", "carp"),
category = c("A", "A", "A", "B", "B", "B", "C", "C")),
.Names = c("keyword", "category"),
class = "data.frame", row.names = c(NA,-8L))
df2 <- structure(list(description = structure(c(2L, 6L, 5L, 1L, 4L,3L),
.Label = c("dogged dog", "false cat", "hawkward carp", "sad catfish", "shady pigeon", "smiling pig"), class = "factor")),
.Names = "description", row.names = c(NA, -6L), class = "data.frame")
## Load packages:
library(stringr)
library(dplyr)
library(tidyr)
## For each entry in df2$description count how many times each keyword
## is contained in it:
outList <- lapply(df2$description, function(description){
outDf <- data.frame(description = description,
value = vapply(stringr::str_extract_all(description, df1$keyword),
length, numeric(1)), category = df1$category)
})
## Combine to one long data frame and aggregate by category:
outLongDf<- do.call('rbind', outList) %>%
group_by(description, category) %>%
dplyr::summarise(value = sum(value))
## Reshape from long to wide format:
outWideDf <- tidyr::spread(data = outLongDf, key = category,
value = value)
outWideDf
# Source: local data frame [6 x 4]
# Groups: description [6]
#
# description A B C
# * <fctr> <dbl> <dbl> <dbl>
# 1 dogged dog 2 0 0
# 2 false cat 1 0 0
# 3 hawkward carp 0 1 1
# 4 sad catfish 1 0 1
# 5 shady pigeon 1 1 0
# 6 smiling pig 1 0 0
This approach, however also catches the "pig" in "pigeon" and the "cat" in "catfish". I don't know if this is what you want, though.

What you are looking for is a so-called document-term-matrix (or dtm in short), which stems from NLP (Natural Language Processing). There are many options available. I prefer text2vec. This package is blazingly fast (I wouldn't be surprised if it would outperform the other solutions here by a large magnitude) especially in combination with tokenizers.
In your case the code would look something like this:
# Create the data
df1 <- structure(list(keyword = c("cat", "dog", "pig", "crow", "pigeon", "hawk", "catfish", "carp"),
category = c("A", "A", "A", "B", "B", "B", "C", "C")),
.Names = c("keyword", "category"),
class = "data.frame", row.names = c(NA,-8L))
df2 <- structure(list(description = structure(c(2L, 6L, 5L, 1L, 4L,3L),
.Label = c("dogged dog", "false cat", "hawkward carp", "sad catfish", "shady pigeon", "smiling pig"), class = "factor")),
.Names = "description", row.names = c(NA, -6L), class = "data.frame")
# load the libraries
library(text2vec) # to create the dtm
library(tokenizers) # to help creating the dtm
library(reshape2) # to reshape the data from wide to long
# 1. create the vocabulary from the keywords
vocabulary <- vocab_vectorizer(create_vocabulary(itoken(df1$keyword)))
# 2. create the dtm
dtm <- create_dtm(itoken(as.character(df2$description)), vocabulary)
# 3. convert the sparse-matrix to a data.frame
dtm_df <- as.data.frame(as.matrix(dtm))
dtm_df$description <- df2$description
# 4. melt to long format
df_result <- melt(dtm_df, id.vars = "description", variable.name = "keyword")
df_result <- df_result[df_result$value == 1, ]
# 5. combine the data, i.e., add category
df_final <- merge(df_result, df1, by = "keyword")
# keyword description value category
# 1 carp hawkward carp 1 C
# 2 cat false cat 1 A
# 3 catfish sad catfish 1 C
# 4 dog dogged dog 1 A
# 5 pig smiling pig 1 A
# 6 pigeon shady pigeon 1 B

Whatever the implementation, counting the number of matches per category needs k x d comparisons, where k is the number of keywords and d the number of descriptions.
There are a few tricks to make solve this problem fast and without a lot of memory:
Use vectorized operations. These can be performed a lot quicker than use for loops. Note that lapply, mapply or vapply are just shorthand for for loops. I parallelize (see next) over the keywords such that the vectorization can be over the descriptions which is the largest dimension.
Use parallelization. Optimally using your multiple cores speeds up the proces at the cost of an increase in memory (since every core needs its own copy).
Example:
keywords <- stringi::stri_rand_strings(400, 2)
categories <- letters[1:15]
keyword_categories <- sample(categories, 400, TRUE)
descriptions <- stringi::stri_rand_strings(3e6, 20)
keyword_occurance <- function(word, list_of_descriptions) {
description_keywords <- str_detect(list_of_descriptions, word)
}
category_occurance <- function(category, mat) {
rowSums(mat[,keyword_categories == category])
}
list_keywords <- mclapply(keywords, keyword_occurance, descriptions, mc.cores = 8)
df_keywords <- do.call(cbind, list_keywords)
list_categories <- mclapply(categories, category_occurance, df_keywords, mc.cores = 8)
df_categories <- do.call(cbind, list_categories)
With my computer this takes 140 seconds and 14GB RAM to match 400 keywords in 15 categories to 3 million descriptions.

R - how to loop through a dataframe to match multiple substrings - concatenate all matches in a new column

I am quite new to R - have worked on this all day but am out of ideas.
I have a dataframe with long descriptions in one column, eg:
df:
ID Name Description
1 A ABC DEF
2 B ARS XUY
3 C ASD
And I have a vector of search terms:
ABC
ARS
XUY
DE
I would like to go through each row in the dataframe and search the Description for any of the search terms. I then want all matches to be concatenated in a new column in the dataframe, e.g.:
ID Name Description Matches
1 A ABC DEF ABC
2 B ARS XUY ARS;XUY
3 C ASD
I would want to search ~100k rows with 1000 search terms.
Does anyone have any ideas? I was able to get a matrix with sapply and grepl, but I'd rather have a concatenated solution.

One option using strsplit and %in% instead of regex:
df$Matches <- sapply(strsplit(as.character(df$Description), '\\s'),
function(x){paste(search[search %in% x], collapse = ';')})
df
# ID Name Description Matches
# 1 1 A ABC DEF ABC
# 2 2 B ARS XUY ARS;XUY
# 3 3 C ASD
data:
search <- c("ABC", "ARS", "XUY", "DE")
df <- structure(list(ID = 1:3, Name = structure(1:3, .Label = c("A",
"B", "C"), class = "factor"), Description = structure(1:3, .Label = c("ABC DEF",
"ARS XUY", "ASD"), class = "factor"), Matches = c("ABC", "ARS;XUY",
"")), .Names = c("ID", "Name", "Description", "Matches"), row.names = c(NA,
-3L), class = "data.frame")

Another option, which I tried to use in the comments, is to use the stringr package. There are two potential downsides to this approach: 1) it uses regex, and 2) it returns the search term matched instead of the value found.
library(stringr)
df = data.frame(Name=LETTERS[1:3],
Description=c("ABC DEF", "ARS XUY", "ASD"),
stringsAsFactors=F)
search_terms = c("ABC", "ARS", "XUY", "DE")
regex = paste(search_terms, collapse="|")
df$Matches = sapply(str_extract_all(df$Description, regex), function(x) paste(x, collapse=";"))
df
# Name Description Matches
# (chr) (chr) (chr)
# 1 A ABC DEF ABC;DE
# 2 B ARS XUY ARS;XUY
# 3 C ASD
With that being said, I think Alistaire's solution is the better approach since it doesn't use regex.

Here's an alternative:
df <- data.frame(ID=c(1L,2L,3L),Name=c('A','B','C'),Description=c('ABC DEF','ARS XUY','ASD'),stringsAsFactors=F);
st <- c('ABC','ARS','XUY','DE');
df$Matches <- apply(sapply(paste0('\\b',st,'\\b'),grepl,df$Description),1L,function(m) paste(collapse=';',st[m]));
df;
## ID Name Description Matches
## 1 1 A ABC DEF ABC
## 2 2 B ARS XUY ARS;XUY
## 3 3 C ASD

Duplicate Columns when using the merge statement

When I try to merge some codes to the code descriptions I get 2 duplicate columns. I start out with this:
Table Name: Test
ID State
1 5
2 2
3 5
and want to merge it with this:
Table Name: statecode
StateID State
5 Mass
2 NY
to make a table like this:
ID State
1 Mass
2 NY
3 Mass
However, I get a table like this:
ID State State
1 5 Mass
2 2 NY
3 5 Mass
I used the merge command like this:
test = merge(x = test, y = statecode, by.x = "State", by.y = "StateID", all.x = T)
Is there a better function other than merge to use in this case? Maybe one to just replace the state code with the state name?
Thank you very much for the help!

You do have to say which column you want to drop, but you can express it concisely using dplyr, for example.
Generating sample data based on yours (but correcting the column names):
test <- read.table(text =
"ID StateID
1 5
2 2
3 5", header = TRUE)
statecode <- read.table(text =
"
StateID State
5 Mass
2 NY", header = TRUE)
Using dplyr:
library(dplyr)
test %>% left_join(statecode, by = "StateID") %>% select(-StateID)
ID State
1 1 Mass
2 2 NY
3 3 Mass

Another way with base R:
Pmerge <- function(df1, df2) {
res <- suppressWarnings(merge(df1, df2, by.x = "State", by.y = "Code", all.x = T)[,-1])
newdf <- res[order(res$ID),]
row.names(newdf) <- 1:nrow(newdf)
newdf
}
Pmerge(Test, statecode)
ID State
1 1 Mass
2 2 NY
3 3 Mass

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Cleaning Data: Multiple Misspelled Strings in R - r

I have over 100 strings that I want to change, for ex: Scheduled Caste, Schdeduled Caste, Schedulded Caste need to be changed to SC. I have been doing it like this: Haryana3$Category[Haryana3$Category%in% "Scheduled Caste"] <- "SC" Is there anything I can do that's more efficient?

Use gsub Haryana3$Category <- gsub("Scheduled Caste", "SC", Haryana3$Category) You can use data.table and try the following: library(data.table) setDT(Haryana3) Haryana3[, Catergory:= gsub("Scheduled Caste", "SC", Category)]

Related

R: Pulling data from one column to create new columns

How to add value base on specific character ,also fix with certain digits in R

Checking if keyword in one table is within a string in another table using R

R - how to loop through a dataframe to match multiple substrings - concatenate all matches in a new column

Duplicate Columns when using the merge statement

Categories

Resources