how to split fields after reading the file in R - r

I have a file with this format in each line:
f1,f2,f3,a1,a2,a3,...,an
Here, f1, f2, and f3 are the fixed fields separated by ,, but f4 is the whole a1,a2,...,an where n can vary.
How can I read this into R and conveniently store those variable-length a1 to an?
Thank you.
My file looks like the following
3,a,-4,news,finance
2,b,1,politics
1,a,0
2,c,2,book,movie
...

It is not clear what you mean by "conveniently store". If you think a data frame will suit you, try this:
df <- read.table(text = "3,a,-4,news,finance
2,b,1,politics
1,a,0
2,c,2,book,movie",
sep = ",", na.strings = "", header = FALSE, fill = TRUE)
names(df) <- c(paste0("f", 1:3), paste0("a", 1:(ncol(df) - 3)))
Edit following #Ananda Mahto's comment.
From ?read.table:
"The number of data columns is determined by looking at the first five lines of input".
Thus, if the maximum number of columns with data occurs somewhere after the first five lines, the solution above will fail.
Example of failure
# create a file with max five columns in the first five lines,
# and six columns in the sixth row
cat("3, a, -4, news, finance",
"2, b, 1, politics",
"1, a, 0",
"2, c, 2, book,movie",
"1, a, 0",
"2, c, 2, book, movie, news",
file = "df",
sep = "\n")
# based on the first five rows, read.table determines that number of columns is five,
# and creates an incorrect data frame
df <- read.table(file = "df",
sep = ",", na.strings = "", header = FALSE, fill = TRUE)
df
Solution
# This can be solved by first counting the maximum number of columns in the text file
ncol <- max(count.fields("df", sep = ","))
# then this count is used in the col.names argument
# to handle the unknown maximum number of columns after row 5.
df <- read.table(file = "df",
sep = ",", na.strings = "", header = FALSE, fill = TRUE,
col.names = paste0("f", seq_len(ncol)))
df
# change column names as above
names(df) <- c(paste0("f", 1:3), paste0("a", 1:(ncol(df) - 3)))
df

#
# Read example data
#
txt <- "3,a,-4,news,finance\n2,b,1,politics\n1,a,0\n2,c,2,book,movie"
tc = textConnection(txt)
lines <- readLines(tc)
close(tc)
#
# Solution
#
lines_split <- strsplit(lines, split=",", fixed=TRUE)
ind <- 1:3
df <- as.data.frame(do.call("rbind", lapply(lines_split, "[", ind)))
df$V4 <- lapply(lines_split, "[", -ind)
#
# Output
#
V1 V2 V3 V4
1 3 a -4 news, finance
2 2 b 1 politics
3 1 a 0
4 2 c 2 book, movie

A place to start:
dat <- readLines(file) ## file being your file
df <- data.frame(
f1=sapply(dat_split, "[[", 1),
f2=sapply(dat_split, "[[", 2),
f3=sapply(dat_split, "[[", 3),
a=unlist( sapply(dat_split, function(x) {
if (length(x) <= 3) {
return(NA)
} else {
return(paste(x[4:length(x)], collapse=","))
}
}) )
)
and when you need to pull things out of a, you can do splitting as necessary.

Related

How to open this text file properly in R?

So I have this line of code in a file:
{"id":53680,"title":"daytona1-usa"}
But when I try to open it in R using this:
df <- read.csv("file1.txt", strip.white = TRUE, sep = ":")
It produces columns like this:
Col1: X53680.title
Col2: daytona1.usa.url
What I want to do is open the file so that the columns are like this:
Col1: 53680
Col2: daytona1-usa
How can I do this in R?
Edit: The actual file I'm reading in is this:
{"id":53203,"title":"bbc-moment","url":"https:\/\/wow.bbc.com\/bbc-ids\/live\/enus\/211\/53203","type":"audio\/mpeg"},{"id":53204,"title":"shg-moment","url":"https:\/\/wow.shg.com\/shg-ids\/live\/enus\/212\/53204","type":"audio\/mpeg"},{"id":53205,"title":"was-zone","url":"https:\/\/wow.was.com\/was-ids\/live\/enus\/213\/53205","type":"audio\/mpeg"},{"id":53206,"title":"xx1-zone","url":"https:\/\/wow.xx1.com\/xx1-ids\/live\/enus\/214\/53206","type":"audio\/mpeg"},], WH.ge('zonemusicdiv-zonemusic'), {loop: true});
After reading it in, I remove the first column and then every 3rd and 4th column with this:
# Delete the first column
df <- df[-1]
# Delete every 3rd and 4th columns
i1 <- rep(seq(3, ncol(df), 4) , each = 2) + 0:1
df <- df[,-i1]
Thank you.
Edit 2:
Adding this fixed it:
df[] <- lapply(df, gsub, pattern = ".title", replacement = "", fixed = TRUE)
df[] <- lapply(df, gsub, pattern = ",url", replacement = "", fixed = TRUE)
If it is a single JSON in the file, then
jsonlite::read_json("file1.txt")
# $id
# [1] 53680
# $title
# [1] "daytona1-usa"
If it is instead NDJSON (Newline-Delimited json), then
jsonlite::stream_in(file("file1.txt"), verbose = FALSE)
# id title
# 1 53680 daytona1-usa
Although the answers above would have been correct if the data had been formatted properly, it seems they don't work for the data I have so what I ended up going with was this:
df <- read.csv("file1.txt", header = FALSE, sep = ":", dec = "-")
# Delete the first column
df <- df[-1]
# Delete every 3rd and 4th columns
i1 <- rep(seq(3, ncol(df), 4) , each = 2) + 0:1
df <- df[,-i1]
df[] <- lapply(df, gsub, pattern = ".title", replacement = "", fixed = TRUE)
df[] <- lapply(df, gsub, pattern = ",url", replacement = "", fixed = TRUE)

How to find names of columns that have non English values? R

I have a data as shown in image some columns have non English words how can I find those column names using R programming?
Data and expected result is shown in the image.
First some reproducible data:
df <- data.frame(
Var1 = c("some", "data", "ß", "کابل"),
Var2 = c("کابل", "data", "کابل", "data"),
Var3 = c("some", "data", "more", "data"),
Var4 = c("some", "data", "more", "data")
)
df
The solution first strings all columns together using paste0and then deselects (-) those column strings in which greplfinds matches of non-ASCII characters (which are equivalent to non-English characters):
df[, -which(grepl("[^ -~]", apply(df, 2, paste0, collapse = " ")))]
Var3 Var4
1 some some
2 data data
3 more more
4 data data
EDIT:
To get only the names, simply insert the whole statement into names:
names(df[, -which(grepl("[^ -~]", apply(df, 2, paste0, collapse = " ")))])
[1] "Var3" "Var4"
Base R:
lapply(df, function(x){
ifelse(grepl("\\#", x), x, gsub(paste0(c(letters, LETTERS), collapse = "|"), "", x))})
Return names:
names(df)[sapply(df, function(x) {
ifelse(grepl("\\#", x), FALSE,
any(gsub(paste0(
c(letters, LETTERS), collapse = "|"
), "", x) == ""))
})]

how to parse a text file that contains the column names at the beginning of the file?

My text file looks like the following
"
file1
cols=
col1
col2
# this is a comment
col3
data
a,b,c
d,e,f
"
As you can see, the data only starts after the data tag and the rows before that essentially tell me what the column names are. There could be some comments which means the number of rows before the data tag is variable.
How can I parse that in R? Possibly with some tidy tools?
Expected output is:
# A tibble: 2 x 3
col1 col2 col3
<chr> <chr> <chr>
1 a b c
2 d e f
Thanks!
Here is a base way with scan(). strip.white = T to remove blank lines and comment.char = "#" to remove lines leading with #.
text <- scan("test.txt", "", sep = "\n", strip.white = T, comment.char = "#")
text
# [1] "file1" "cols=" "col1" "col2" "col3" "data" "a,b,c" "d,e,f"
ind1 <- which(text == "cols=")
ind2 <- which(text == "data")
df <- read.table(text = paste(text[-seq(ind2)], collapse = "\n"),
sep = ",", col.names = text[(ind1 + 1):(ind2 - 1)])
df
# col1 col2 col3
# 1 a b c
# 2 d e f
I saved your file as ex_text.txt on my machine, removing the start and end quotes. Here's a solution. I don't know how extendable this is, and it might not work for "weirder" data.
# initialize
possible_names <- c()
not_data <- TRUE # stop when we find "data"
n <- 20 # lines to check the txt file
while (not_data){
# read txt line by line
possible_names <- readLines("ex_text.txt", n = n)
not_data <- all(possible_names != "data") # find data?
n <- n + 20 # increment to read more lines if necessary
}
# where does ddata start?
data_start <- which(possible_names == "data")
# remove unnecessary text and find actual column names
possible_names <- possible_names[2:(data_start-1)]
possible_names <- possible_names[""!= possible_names] # remove any blank space
col_names <- possible_names[!grepl("#.*", possible_names)] # remove comments
# read data
read.delim("ex_text.txt",
skip = data_start,
sep = ",",
col.names = col_names,
header = FALSE)
# col1 col2 col3
# 1 a b c
# 2 d e f

Convert tab delimited string to dataframe

I have a long character string that looks like this, except where I've shown double back slashes there is, in reality, only one backslash.
char.string <- "BAT\\tUSA\\t\\tmedium\\t0.8872\\t9\\tOff production\\tCal1|Cal2\\r\\nGNAT\\tCAN\\t\\small\\t0.3824\\t11\\tOff production\\tCal3|Cal8|Cal9\\r\\n"
I tried the following.
df <- data.frame(do.call(rbind, strsplit(char.string, "\t", fixed=TRUE)))
df <- ldply (df, data.frame)
The first returns a vector. The second returns thousands of rows and two columns, one consisting of sequential numbers and the second consisting of all the data.
I'm trying to achieve this:
item = c("BAT", "GNAT")
origin = c("USA", "CAN")
size = c("medium", "small")
lot = c("0.8872", "0.3824")
mfgr = c("9", "11")
stat = c("Off production", "Off production")
line = c("Cal1|Cal2", "Cal3|Cal8|Cal9")
df = data.frame(item, origin, size, lot, mfgr, stat, line)
df
item origin size lot mfgr stat line
1 BAT USA medium 0.8872 9 Off production Cal1|Cal2
2 GNAT CAN small 0.3824 11 Off production Cal3|Cal8|Cal9
read.table() should actually be just fine here, but you have two basic problems:
There's two typos
a. I'm assuming you don't want \\small, but rather small
b. You have \\t\\tmedium where I think you want just \\tmedium
"\\t" is not the same as "\t"
Try this
# Start with your original input
char.string <- "BAT\\tUSA\\t\\tmedium\\t0.8872\\t9\\tOff production\\tCal1|Cal2\\r\\nGNAT\\tCAN\\t\\small\\t0.3824\\t11\\tOff production\\tCal3|Cal8|Cal9\\r\\n"
# Eliminate the typos
char.string <- sub("\\\\s", "s", char.string)
char.string <- sub("\\\\t\\\\t", "\\\\t", char.string)
# Convert \\t, etc. to actual tabs and newlines
char.string <- gsub("\\\\t", "\t", char.string)
char.string <- gsub("\\\\r", "\r", char.string)
char.string <- gsub("\\\\n", "\n", char.string)
# Read the data into a dataframe
df <- read.table(text = char.string, sep = "\t")
# Add the colnames
colnames(df) <- c("item", "origin", "size", "lot", "mfgr", "stat", "line")
# And take a look at the result
df
item origin size lot mfgr stat line
1 BAT USA medium 0.8872 9 Off production Cal1|Cal2
2 GNAT CAN small 0.3824 11 Off production Cal3|Cal8|Cal9
I took some liberties with what I think are typos in your char.string.
library(tidyverse)
char.string <- "BAT\\tUSA\\tmedium\\t0.8872\\t9\\tOff production\\tCal1|Cal2\\r\\nGNAT\\tCAN\\tsmall\\t0.3824\\t11\\tOff production\\tCal3|Cal8|Cal9\\n"
lapply(
str_split(gsub("\\\\n", "", char.string), "\\\\r")[[1]]
, function(x) {
y <- str_split(x, "\\\\t")[[1]]
data.frame(
item = y[1]
, origin = y[2]
, size = y[3]
, lot = y[4]
, mfgr = y[5]
, stat = y[6]
, line = y[7]
, stringsAsFactors = F
)
}) %>%
bind_rows()
item origin size lot mfgr stat line
1 BAT USA medium 0.8872 9 Off production Cal1|Cal2
2 GNAT CAN small 0.3824 11 Off production Cal3|Cal8|Cal9

R: Read text file in which new lines start after the nth observation

Here is an extract of my text file:
Assets
Notes
2017
2016
Cash
6
12,000,000
11,000,000
I would like to read this file into a data frame containing 4 columns. It should look something like this:
Assets Notes 2017 2016
Cash 6 12000000 11000000
I'm thinking of looping to read a new line every four observations, but it looks like it's not the most efficient way to read the file into R. Any suggestions?
1) base Read Lines into character vector L. At the bottom in the Note we show Lines reproducibly but you could replace the line that reads it in with the commented out line changing the file name appropriately.
Next remove commas and reshape it into an n x 4 matrix m. Then collapse the rows into a string vector L2 and read that with read.table.
No packages are used.
# L <- readLines("myfile")
L <- readLines(textConnection(Lines))
m <- matrix(gsub(",", "", L),, 4, byrow = TRUE)
L2 <- apply(m, 1, paste, collapse = " ")
read.table(text = L2, header = TRUE, check.names = FALSE, as.is = TRUE)
giving:
Assets Notes 2017 2016
1 Cash 6 12000000 11000000
2) dplyr/tidyr Using L from (1) we create a two column data frame with column name (using recycling) and contents and then spread it out to wide form.
library(dplyr)
library(tidyr)
L %>%
{ data.frame(Name = factor(.[1:4], levels = .[1:4]),
Contents = gsub(",", "", .[-(1:4)])) } %>%
spread(Name, Contents, convert = TRUE)
Note
Lines <- "Assets
Notes
2017
2016
Cash
6
12,000,000
11,000,000"
data <- structure(list(V1 = c("Assets", "Notes", "2017", "2016", "Cash",
"6", "12,000,000", "11,000,000")), .Names = "V1", class = "data.frame", row.names = c(NA,-8L))
data.frame(matrix(unlist(data), ncol = 4, byrow = T))

Resources