Decoding to Chinese characters in R - r

I accidentally converted the columns of Chinese characters in a tab delimited text file to encoded characters. The records are encoded to look like this:
<U+5ECA><U+574A><U+5E02>
How do I convert that to this?
廊坊市
You can recreate the first 6 lines of my data frame in R with this code:
structure(list(City_Code = c(110000L, 110000L, 110000L, 110000L, 110000L, 110000L), Origin_City = c("<U+5ECA><U+574A><U+5E02>", "<U+4FDD><U+5B9A><U+5E02>", "<U+5929><U+6D25><U+5E02>", "<U+5F20><U+5BB6> <U+53E3><U+5E02>", "<U+627F><U+5FB7><U+5E02>", "<U+90AF><U+90F8><U+5E02>"), Origin_Province = c("<U+6CB3><U+5317><U+7701>", "<U+6CB3><U+5317><U+7701>", "<U+5929><U+6D25><U+5E02>", "<U+6CB3><U+5317><U+7701>", "<U+6CB3><U+5317><U+7701>", "<U+6CB3><U+5317><U+7701>"), Destination_City = c("<U+5317><U+4EAC>", "<U+5317><U+4EAC>", "<U+5317><U+4EAC>", "<U+5317><U+4EAC>", "<U+5317<U+4EAC>", "<U+5317><U+4EAC>"), Percentage = c("28.08%", "6.86%", "5.70%", "3.38%", "3.05%", "2.76%"), Date = c("2020-03-13", "2020-03-13", "2020-03-13", "2020-03-13", "2020-03-13", "2020-03-13")), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")

This code will convert the string to the appropriate Chinese characters:
library(stringi)
string <- '<U+5ECA><U+574A><U+5E02>'
cat(stri_unescape_unicode(gsub("<U\\+(....)>", "\\\\u\\1", string)))
# Output: 廊坊市
Source: Convert unicode to readable characters in R

Related

R stringr unexpected behavior with str_replace & str_pad. Bug or Layer-8 problem?

I am using R 4.1.3 and the stringr-package 1.4.0 and get some unexpected results from this code:
stringr::str_replace(string = "5",
pattern = "([0-9]+)",
replacement = stringr::str_pad(string = "\\1", width = 3, side = "left", pad = "0"))
Expected: "005"; Result: "05".
All the parts generate the expected results:
(1) The padding
stringr::str_pad(string = "5", width = 3, side = "left", pad = "0")
Returns "005"
(2) The regex match
stringr::str_replace(string = "5", pattern = "([0-9]+)", replacement = "\\1")
Returns "5".
Only the combination of these two leads to unexpected behavior.
For clarification, I already have working code and several solutions to choose from to achive what I want to do, i.e. using an anonymous function:
stringr::str_replace(string = "5",
pattern = "([0-9]+)",
replacement = {\(x) stringr::str_pad(string = x, width = 3, side = "left", pad = "0")})
The intention of the post is to clarifiy why the code at the top does not work.
Thanks in advance for any helpful input.
Edit:
It seems that "\1" refers to the content of the capture group, but the character length is determined from the literal "\1".
stringr::str_replace(string = "5",
pattern = "([0-9]+)",
replacement = {\(x) as.character(nchar(x))})
stringr::str_replace(string = "5",
pattern = "([0-9]+)",
replacement = as.character(nchar("\\1")))
Returns "1" and "2". The second example always returns "2" as replacement for the captured group, independend of its content.
The problem here is that \1 in the scope of the inner call to str_pad() does not mean the first capture group, but rather the number 1 escaped by backslash. Instead, consider this version as a workaround:
x <- c("5", "12", "345", "1234")
output <- sub("^0{1,3}(\\d{3,})", "\\1", paste0("000", x))
output
[1] "005" "012" "345" "1234"

Dealing with data from the same user twice when moving through a loop in R

Let's say I have a dataframe that looks a bit like this:
df <- tribble(
~person_id, ~timestamp,
"1", "02:26:10.000000",
"1", "03:45:37.000000",
"2", "22:03:39.000000",
"3", "11:46:24.000000",
"4", "18:26:55.000000",
"5", "17:01:20.000000",
"5", "03:10:17.000000",
"6", "23:16:05.000000",
)
df
Now let's say I import individual .csv files that each match the person_id like so:
user_files <- list.files(pattern = "\\.csv$", path = here("data"),
full.names = TRUE)
user_files <- user_files[sub("\\.csv$", "", basename(user_files)) %in% df$person_id]
There will naturally be fewer .csv files than the length of df$person_id because persons "1" and "5 appear twice in df$person_id
I would now like to run a for loop that runs a program on each csv file. HOWEVER, where there are more than one of the same person_id, I would like to re-run the loop using the same csv file (since it's the same person again but a different timestamp so will yield different results).
This is what the loops look like
for(i in seq(1:length(user_files))) {
user_file <- read_csv(user_files[i])
#Run lots of analysis on the CSV file
}
Now I need something in the loop that says "if df$person_id occurs more than once, repeat the loop using the same CSV file". Thanks in advance for any assistance.
If the user_files are unique, then match the user_files with the 'df$person_id' use that index to subset the 'user_files'
v1 <- sub("\\.csv$", "", basename(user_files))
user_files2 <- na.omit(user_files[match(df$person_id, v1)])
Now, loop over 'user_files2'
Or a better approach is to merge/join with the original dataset and loop over the filtered data user_files column
library(dplyr)
df1 <- inner_join(df, tibble(user_files,
person_id = sub("\\.csv$", "", basename(user_files))))

Cannot remove rows from data.frame [duplicate]

This question already has answers here:
How do I deal with special characters like \^$.?*|+()[{ in my regex?
(2 answers)
Closed 2 years ago.
I have a series of excel files and I have been using this basic code to import my data for a very long time now. I have not made any changed to the data or the code, but I not read the data properly anymore. I read the files as follow:
apply_fun <- function(filename) {
data <- readxl::read_excel(filename, sheet = "Section Cut Forces - Design",
skip = 1, col_names = TRUE)
data <- data[-1,]
data <- subset(data, StepType == "Max")
data <- data[,c(1,2,6,10)]
data$id <- filename
return(data)
}
filenames <- list.files(pattern = "\\.xlsx", full.names = TRUE)
first <- lapply(filenames,apply_fun)
out <- do.call(rbind,first)
The first few rows of out look like:
structure(list(SectionCut = c("1", "1", "1", "1", "1", "2", "2",
"2", "2", "2"), OutputCase = c("Service (after losses)", "LL-1",
"LL-2", "LL-3", "LL-4", "Service (after losses)", "LL-1", "LL-2",
"LL-3", "LL-4"), V2 = c("11.522", "28.587", "42.246000000000002",
"44.212000000000003", "36.183", "9.8469999999999995", "23.989000000000001",
"37.408999999999999", "43.401000000000003", "40.450000000000003"
), M3 = c("299728.66100000002", "42863.517999999996", "63147.332999999999",
"69628.464000000007", "59196.74", "0", "27.942", "44.863999999999997",
"46.31", "36.204999999999998"), id = c("./100-12-S00.xlsx", "./100-12-S00.xlsx",
"./100-12-S00.xlsx", "./100-12-S00.xlsx", "./100-12-S00.xlsx",
"./100-12-S00.xlsx", "./100-12-S00.xlsx", "./100-12-S00.xlsx",
"./100-12-S00.xlsx", "./100-12-S00.xlsx")), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
I try to remove rows as:
out2 <- out[!grep("Service (after losses)", out$OutputCase),]
but the result is 0 observations.
I must say that this just started being an issue for me. I have been able to run this code successfully for months now and never had an issue.
() are special symbols in regex. They have special meaning when you use them in functions like grep/grepl etc. You can use fixed = TRUE in grep to match them exactly. Also ! should be used with grepl and - should be used with grep to remove rows.
out[-grep("Service (after losses)", out$OutputCase, fixed = TRUE),]
Apart from that this looks like an exact match so why use pattern matching with grep? Try :
out[out$OutputCase != 'Service (after losses)', ]

Convert a column from string to Int in Dataframe

Imagine I have a dataframe
df=DataFrame(A=rand(5),B=["8", "9", "4", "3", "12"])
What I want to do is to convert column B to Int type, so I used
df[!,:B] = convert.(Int64,df[!,:B])
But I got warning:
'Cannot Convert an object of type string to an object of type Int64'
Could you please tell me why I was wrong?
What you are looking for is the parse function, broadcast over the elements in the column with dot notation:
df = DataFrame(A = rand(5), B = ["8", "9", "4", "3", "12"])
df[!, :B] = parse.(Int64, df[!, :B])
I believe what you want is df[!,:B] = Int64.(df[!,:B]). Convert is only defined between types where you can convert without losing information (ie in this case, you can't convert an arbitrary string to an Int)

how to return rows with a keyword within a string contained in a cell in r

I thought that this would be a simple one line of code, but the solution to my challenge is eluding me. I am betting that my limited experience with the domain of R programming might be the source.
Data Set
df <- structure(list(Key_MXZ = c(1731025L, 1731022L, 1731010L, 1730996L,
1722128L, 1722125L, 1722124L, 1722123L, 1722121L, 1722116L, 1722111L,
1722109L), Key_Event = c(1642965L, 1642962L, 1647418L, 1642936L,
1634904L, 1537090L, 1537090L, 1616520L, 1634897L, 1634892L, 1634887L,
1634885L), Number_Call = structure(c(11L, 9L, 10L, 12L, 1L, 3L,
2L, 4L, 5L, 6L, 8L, 7L), .Label = c("3004209178-2010-04468",
"3004209178-2010-04469", "3004209178-2010-04470", "3004209178-2010-04471",
"3004209178-2010-04472", "3004209178-2010-04475", "3004209178-2010-04477",
"3004209178-2010-04478", "3004209178-2010-04842", "3004209178-2010-04850",
"I wish to return this row with the header", "Maybe this row will work too"
), class = "factor")), .Names = c("Key_MXZ", "Key_Event", "Number_Call"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5",
"6", "7", "8", "9", "10", "11", "12"))
In the last column I have placed two strings among other data types that would be used to identify the rows for a new dataframe -- using the phrase "this row". The end result might look like:
Key_MXZ|Key_Event|Number_Call
1|1731025|1642965|I wish to return this row with the header
4|1730996|1642936|Maybe this row will work too
I have tried the following variations of code and others unseen to breakthrough with little success.
txt <- c("this row")
table1 <- df[grep(txt,df),]
table2 <- df[pmatch(txt,df),]
df[,3]<-is.logical(df[,3])
table3 <- subset(df,grep(txt,df[,3]))
Any ideas on this challenge?
Quite similar to DMTs answer. Below uses data.table approach which is fast in case you have millions of rows:
setDT(df); setkey(df, Number_Call)
df[grep("this row", Number_Call, ignore.case = TRUE)]
Key_MXZ Key_Event Number_Call
1: 1731025 1642965 I wish to return this row with the header
2: 1730996 1642936 Maybe this row will work too
Here is an approach that uses qdap's Search function. It's a wrapper for agrep so it can do fuzzy matching and the degree of fuzziness can be set:
library(qdap)
Search(df, "this row", 3)
## Key_MXZ Key_Event Number_Call
## 1 1731025 1642965 I wish to return this row with the header
## 4 1730996 1642936 Maybe this row will work too
go with
df[grep("this row", df$Number_Call, fixed=TRUE),]
# Key_MXZ Key_Event Number_Call
#1 1731025 1642965 I wish to return this row with the header
#4 1730996 1642936 Maybe this row will work too
Just needed to reference the actual column you wanted grep to try to match
fixed=TRUE looks for exact matches, and grep returns indeces of those elements in the list that hit the match. If your match is a bit more nuanced you can replace "this row" with a regular expression

Resources