How to find names of columns that have non English values? R - r

I have a data as shown in image some columns have non English words how can I find those column names using R programming?
Data and expected result is shown in the image.

First some reproducible data:
df <- data.frame(
Var1 = c("some", "data", "ß", "کابل"),
Var2 = c("کابل", "data", "کابل", "data"),
Var3 = c("some", "data", "more", "data"),
Var4 = c("some", "data", "more", "data")
)
df
The solution first strings all columns together using paste0and then deselects (-) those column strings in which greplfinds matches of non-ASCII characters (which are equivalent to non-English characters):
df[, -which(grepl("[^ -~]", apply(df, 2, paste0, collapse = " ")))]
Var3 Var4
1 some some
2 data data
3 more more
4 data data
EDIT:
To get only the names, simply insert the whole statement into names:
names(df[, -which(grepl("[^ -~]", apply(df, 2, paste0, collapse = " ")))])
[1] "Var3" "Var4"

Base R:
lapply(df, function(x){
ifelse(grepl("\\#", x), x, gsub(paste0(c(letters, LETTERS), collapse = "|"), "", x))})
Return names:
names(df)[sapply(df, function(x) {
ifelse(grepl("\\#", x), FALSE,
any(gsub(paste0(
c(letters, LETTERS), collapse = "|"
), "", x) == ""))
})]

Related

How to replace empty cells of a particular column in a list with a character in R

I have a list with around 100 data frames and want to replace the empty cells in a particular column (named Event) of all the data frames in the list with a character. I first tried the following code,
lapply(my_list, function(x) replace(x,is.na(x),"No_Event"))
The above code replaces all the NA into '"No_Event". But I want the replacement of the empty cells in a specific column. Also not sure how to represent the blank cells in the code. The following " " doesn't work.
Then I tried,
lapply(my_list, function(x) transform(x, Event = chartr(" ", 'No_Event', Event))
I understand that the above code replaces a particular letter with the specified character, but not sure how to transform the empty/blank cells of the specific column with a character. Besides, I also tried some other codes, which produce errors. Apologies if the question is very basic and the approach that I followed is wrong.
Thanks
Here is a reproducible example (R Version 4.1.0)
library(tidyverse)
my_list <- list(
data.frame(a = 1:5, Event = c(6, "", "", 9, ""), c = 11:15),
data.frame(a = 1:5, Event = c("", "", "", 8, 9), c = 16:20)
)
lapply(my_list, FUN = \(x) {
x |> mutate(Event = case_when(Event == "" ~ "No event", TRUE ~ Event))
})
For earlier R versions:
lapply(my_list, FUN = function(x) {
x %>% mutate(Event = case_when(Event == "" ~ "No event", TRUE ~ Event))
})

How to open this text file properly in R?

So I have this line of code in a file:
{"id":53680,"title":"daytona1-usa"}
But when I try to open it in R using this:
df <- read.csv("file1.txt", strip.white = TRUE, sep = ":")
It produces columns like this:
Col1: X53680.title
Col2: daytona1.usa.url
What I want to do is open the file so that the columns are like this:
Col1: 53680
Col2: daytona1-usa
How can I do this in R?
Edit: The actual file I'm reading in is this:
{"id":53203,"title":"bbc-moment","url":"https:\/\/wow.bbc.com\/bbc-ids\/live\/enus\/211\/53203","type":"audio\/mpeg"},{"id":53204,"title":"shg-moment","url":"https:\/\/wow.shg.com\/shg-ids\/live\/enus\/212\/53204","type":"audio\/mpeg"},{"id":53205,"title":"was-zone","url":"https:\/\/wow.was.com\/was-ids\/live\/enus\/213\/53205","type":"audio\/mpeg"},{"id":53206,"title":"xx1-zone","url":"https:\/\/wow.xx1.com\/xx1-ids\/live\/enus\/214\/53206","type":"audio\/mpeg"},], WH.ge('zonemusicdiv-zonemusic'), {loop: true});
After reading it in, I remove the first column and then every 3rd and 4th column with this:
# Delete the first column
df <- df[-1]
# Delete every 3rd and 4th columns
i1 <- rep(seq(3, ncol(df), 4) , each = 2) + 0:1
df <- df[,-i1]
Thank you.
Edit 2:
Adding this fixed it:
df[] <- lapply(df, gsub, pattern = ".title", replacement = "", fixed = TRUE)
df[] <- lapply(df, gsub, pattern = ",url", replacement = "", fixed = TRUE)
If it is a single JSON in the file, then
jsonlite::read_json("file1.txt")
# $id
# [1] 53680
# $title
# [1] "daytona1-usa"
If it is instead NDJSON (Newline-Delimited json), then
jsonlite::stream_in(file("file1.txt"), verbose = FALSE)
# id title
# 1 53680 daytona1-usa
Although the answers above would have been correct if the data had been formatted properly, it seems they don't work for the data I have so what I ended up going with was this:
df <- read.csv("file1.txt", header = FALSE, sep = ":", dec = "-")
# Delete the first column
df <- df[-1]
# Delete every 3rd and 4th columns
i1 <- rep(seq(3, ncol(df), 4) , each = 2) + 0:1
df <- df[,-i1]
df[] <- lapply(df, gsub, pattern = ".title", replacement = "", fixed = TRUE)
df[] <- lapply(df, gsub, pattern = ",url", replacement = "", fixed = TRUE)

Return the unique character elements that are mixed with numeric ones in a data.frame

Suppose I have a data.frame that consists of mainly numeric values but also mixed with some character elements.
Is there a way to extract the unique character elements throughout the data.frame?
A toy example along with the desired output is shown below?
DF <- data.frame(x = c(1:3, "*", "."), y = c("--", 4:6, "="), z = 1:5, w = rep("a", 5))
desired_output <- c("*", ".", "--", "=")
You could extract all the values which has only punctuations in it from DF using grep :
unique(grep('^[[:punct:]]+$', as.character(unlist(DF)), value = TRUE))
#[1] "*" "." "--" "="

Data frame output as a single line

I have a dataframe with multiple columns and rows. I am wanting to export this as a .txt file with all values on the same line (i.e one row), with individual values seperated by "," and data from the rows of the df separated by ":"
w<- c(1,5)
x<- c(2,6)
y<- c(3,7)
z<- c(4,8)
df<-data.frame(w,x,y,z)
the output would look like this
1,2,3,4:5,6,7,8:
We can combine data row-wise using apply and paste data together with collapse = ":".
paste0(apply(df, 1, toString), collapse = ":")
#[1] "1, 2, 3, 4:5, 6, 7, 8"
If you want to write it to a file, use:
write.table(df, "df.csv", col.names = FALSE, row.names = FALSE, sep = ",", eol = ":")
If you want the output in R you can use do.call() and paste():
do.call(paste, c(df, sep = ",", collapse = ":"))
[1] "1,2,3,4:5,6,7,8"
We can use str_c
library(stringr)
library(dplyr)
library(purrr)
df %>%
reduce(str_c, sep=",") %>%
str_c(collapse=";")
#[1] "1,2,3,4;5,6,7,8"

how to split fields after reading the file in R

I have a file with this format in each line:
f1,f2,f3,a1,a2,a3,...,an
Here, f1, f2, and f3 are the fixed fields separated by ,, but f4 is the whole a1,a2,...,an where n can vary.
How can I read this into R and conveniently store those variable-length a1 to an?
Thank you.
My file looks like the following
3,a,-4,news,finance
2,b,1,politics
1,a,0
2,c,2,book,movie
...
It is not clear what you mean by "conveniently store". If you think a data frame will suit you, try this:
df <- read.table(text = "3,a,-4,news,finance
2,b,1,politics
1,a,0
2,c,2,book,movie",
sep = ",", na.strings = "", header = FALSE, fill = TRUE)
names(df) <- c(paste0("f", 1:3), paste0("a", 1:(ncol(df) - 3)))
Edit following #Ananda Mahto's comment.
From ?read.table:
"The number of data columns is determined by looking at the first five lines of input".
Thus, if the maximum number of columns with data occurs somewhere after the first five lines, the solution above will fail.
Example of failure
# create a file with max five columns in the first five lines,
# and six columns in the sixth row
cat("3, a, -4, news, finance",
"2, b, 1, politics",
"1, a, 0",
"2, c, 2, book,movie",
"1, a, 0",
"2, c, 2, book, movie, news",
file = "df",
sep = "\n")
# based on the first five rows, read.table determines that number of columns is five,
# and creates an incorrect data frame
df <- read.table(file = "df",
sep = ",", na.strings = "", header = FALSE, fill = TRUE)
df
Solution
# This can be solved by first counting the maximum number of columns in the text file
ncol <- max(count.fields("df", sep = ","))
# then this count is used in the col.names argument
# to handle the unknown maximum number of columns after row 5.
df <- read.table(file = "df",
sep = ",", na.strings = "", header = FALSE, fill = TRUE,
col.names = paste0("f", seq_len(ncol)))
df
# change column names as above
names(df) <- c(paste0("f", 1:3), paste0("a", 1:(ncol(df) - 3)))
df
#
# Read example data
#
txt <- "3,a,-4,news,finance\n2,b,1,politics\n1,a,0\n2,c,2,book,movie"
tc = textConnection(txt)
lines <- readLines(tc)
close(tc)
#
# Solution
#
lines_split <- strsplit(lines, split=",", fixed=TRUE)
ind <- 1:3
df <- as.data.frame(do.call("rbind", lapply(lines_split, "[", ind)))
df$V4 <- lapply(lines_split, "[", -ind)
#
# Output
#
V1 V2 V3 V4
1 3 a -4 news, finance
2 2 b 1 politics
3 1 a 0
4 2 c 2 book, movie
A place to start:
dat <- readLines(file) ## file being your file
df <- data.frame(
f1=sapply(dat_split, "[[", 1),
f2=sapply(dat_split, "[[", 2),
f3=sapply(dat_split, "[[", 3),
a=unlist( sapply(dat_split, function(x) {
if (length(x) <= 3) {
return(NA)
} else {
return(paste(x[4:length(x)], collapse=","))
}
}) )
)
and when you need to pull things out of a, you can do splitting as necessary.

Resources