So I have this line of code in a file:
{"id":53680,"title":"daytona1-usa"}
But when I try to open it in R using this:
df <- read.csv("file1.txt", strip.white = TRUE, sep = ":")
It produces columns like this:
Col1: X53680.title
Col2: daytona1.usa.url
What I want to do is open the file so that the columns are like this:
Col1: 53680
Col2: daytona1-usa
How can I do this in R?
Edit: The actual file I'm reading in is this:
{"id":53203,"title":"bbc-moment","url":"https:\/\/wow.bbc.com\/bbc-ids\/live\/enus\/211\/53203","type":"audio\/mpeg"},{"id":53204,"title":"shg-moment","url":"https:\/\/wow.shg.com\/shg-ids\/live\/enus\/212\/53204","type":"audio\/mpeg"},{"id":53205,"title":"was-zone","url":"https:\/\/wow.was.com\/was-ids\/live\/enus\/213\/53205","type":"audio\/mpeg"},{"id":53206,"title":"xx1-zone","url":"https:\/\/wow.xx1.com\/xx1-ids\/live\/enus\/214\/53206","type":"audio\/mpeg"},], WH.ge('zonemusicdiv-zonemusic'), {loop: true});
After reading it in, I remove the first column and then every 3rd and 4th column with this:
# Delete the first column
df <- df[-1]
# Delete every 3rd and 4th columns
i1 <- rep(seq(3, ncol(df), 4) , each = 2) + 0:1
df <- df[,-i1]
Thank you.
Edit 2:
Adding this fixed it:
df[] <- lapply(df, gsub, pattern = ".title", replacement = "", fixed = TRUE)
df[] <- lapply(df, gsub, pattern = ",url", replacement = "", fixed = TRUE)
If it is a single JSON in the file, then
jsonlite::read_json("file1.txt")
# $id
# [1] 53680
# $title
# [1] "daytona1-usa"
If it is instead NDJSON (Newline-Delimited json), then
jsonlite::stream_in(file("file1.txt"), verbose = FALSE)
# id title
# 1 53680 daytona1-usa
Although the answers above would have been correct if the data had been formatted properly, it seems they don't work for the data I have so what I ended up going with was this:
df <- read.csv("file1.txt", header = FALSE, sep = ":", dec = "-")
# Delete the first column
df <- df[-1]
# Delete every 3rd and 4th columns
i1 <- rep(seq(3, ncol(df), 4) , each = 2) + 0:1
df <- df[,-i1]
df[] <- lapply(df, gsub, pattern = ".title", replacement = "", fixed = TRUE)
df[] <- lapply(df, gsub, pattern = ",url", replacement = "", fixed = TRUE)
Related
I have a set of df with a large number of columns. The column names follow a pattern like so:
my.df <- data.frame(sentiment_brand1_1 = c(1,0,0,1), sentiment_brand1_2 = c(0,1,1,0),
sentiment_brand2_1 = c(1,1,1,1),
sentiment_brand2_2 = c(0,0,0,0),
brand1_rating_1 = c(1,2,3,4),
brand2_rating_1 = c(4,3,2,1))
I'd like to programmatically rename the columns, moving the substrings "brand1" and "brand2" from the middle of the column name to the end, e.g.:
desired_colnames <- c("sentiment_1_brand1",
"sentiment_2_brand1",
"sentiment_1_brand2",
"sentiment_2_brand2",
"rating_1_brand1",
"rating_1_brand2")
Capture the substring groups and rearrange in replacement
sub("(.*)_(brand1)(.*)", "\\1\\3_\\2", v1)
-output
[1] "variable_1_brand1" "_stuff_1_brand1" "thing_brand1"
data
v1 <- c("variable_brand1_1", "_brand1_stuff_1", "_brand1thing")
## Input:
Test <- c("variable_brand1_1", "_brand1_stuff_1", "_brand1thing")
library("stringr")
paste(str_remove(Test, "_brand1"), "_brand1", sep = "")
## OutPut:
[1] "variable_1_brand1" "_stuff_1_brand1" "thing_brand1"
I have a dataframe with multiple columns and rows. I am wanting to export this as a .txt file with all values on the same line (i.e one row), with individual values seperated by "," and data from the rows of the df separated by ":"
w<- c(1,5)
x<- c(2,6)
y<- c(3,7)
z<- c(4,8)
df<-data.frame(w,x,y,z)
the output would look like this
1,2,3,4:5,6,7,8:
We can combine data row-wise using apply and paste data together with collapse = ":".
paste0(apply(df, 1, toString), collapse = ":")
#[1] "1, 2, 3, 4:5, 6, 7, 8"
If you want to write it to a file, use:
write.table(df, "df.csv", col.names = FALSE, row.names = FALSE, sep = ",", eol = ":")
If you want the output in R you can use do.call() and paste():
do.call(paste, c(df, sep = ",", collapse = ":"))
[1] "1,2,3,4:5,6,7,8"
We can use str_c
library(stringr)
library(dplyr)
library(purrr)
df %>%
reduce(str_c, sep=",") %>%
str_c(collapse=";")
#[1] "1,2,3,4;5,6,7,8"
I have a data which is like this:
abc <- data.frame( a = c("[100-150)", "[150, 200)"))
I want to alter it to make it like this:
abc <- data.frame(a = c("100-149", "150-199"))
I know how to replace the brackets:
abc$a <- lapply(abc$a, gsub, pattern = "[", replacement = "", fixed = TRUE)
abc$a <- lapply(abc$a, gsub, pattern = "]", replacement = "", fixed = TRUE)
abc$a <- lapply(abc$a, gsub, pattern = ")", replacement = "", fixed = TRUE)
It is the subtraction of 1 number from the end that is the problem.
Is there a way to do this?
Please note this is just an example, in reality my data has a column like this which is about 2000 rows.
An option with gsubfn. We extract the numbers (\\d+) after the - or , convert it to numeric subtract 1 and paste with -
library(gsubfn)
gsubfn("[-,] ?(\\d+)", ~ paste0("-", as.numeric(x) - 1) , as.character(abc$a))
#[1] "[100-149)" "[150-199)"
I usually use dat = read.table(filename, row.names=1) if the row names are in the first column. What is the corresponding call when the row names are in the last column of the file? I tried dat = read.table(filename, row.names=ncol(dat)) but this did not work as expected, since dat variable does not exist yet.
A base R option would be to use count.fields
read.table('filename.csv', sep=",",
row.names = count.fields('filename.csv', sep=",")[1], header = TRUE)
# col1 col2
#C 1 A
#F 2 B
#D 3 C
#G 4 D
data
df1 <- data.frame(col1 = 1:2, col2 = LETTERS[1:4],
col3 = c('C', 'F', 'D', 'G'), stringsAsFactors=FALSE)
write.csv(df1, 'filename.csv', quote = FALSE, row.names = FALSE)
There is a function in tibble to do this column_to_rownames
dat <- read.table(filename)
dat <- tibble::column_to_rownames(dat, var = "target")
This provides the benefit of using the column name, the order of columns in your source is less relevant.
I would personally just cut and paste the header row into the correct position at the top. Then, speak with you data pipeline folks about why headers are appearing on the bottom of the file. If you want an R solution to this, I can offer the following code.
I don't think there is a nice way to do this using read.table or read.csv. These functions were designed with the header being on the top of the file. You could try the following:
library(readr)
df <- NULL
cols <- list()
line <- 0L
input <- "start"
while (input != "") {
line <- line + 1L
input <- read_lines( file, skip = line - 1L, n_max = 1L )
cols <- strsplit(input, " ")
df <- rbind(df, cols)
}
# remove the last row, which is really just the header names
df <- head(df, -1)
# now take the very last line and assign the names
names(df) <- as.character(cols)
I have a file with this format in each line:
f1,f2,f3,a1,a2,a3,...,an
Here, f1, f2, and f3 are the fixed fields separated by ,, but f4 is the whole a1,a2,...,an where n can vary.
How can I read this into R and conveniently store those variable-length a1 to an?
Thank you.
My file looks like the following
3,a,-4,news,finance
2,b,1,politics
1,a,0
2,c,2,book,movie
...
It is not clear what you mean by "conveniently store". If you think a data frame will suit you, try this:
df <- read.table(text = "3,a,-4,news,finance
2,b,1,politics
1,a,0
2,c,2,book,movie",
sep = ",", na.strings = "", header = FALSE, fill = TRUE)
names(df) <- c(paste0("f", 1:3), paste0("a", 1:(ncol(df) - 3)))
Edit following #Ananda Mahto's comment.
From ?read.table:
"The number of data columns is determined by looking at the first five lines of input".
Thus, if the maximum number of columns with data occurs somewhere after the first five lines, the solution above will fail.
Example of failure
# create a file with max five columns in the first five lines,
# and six columns in the sixth row
cat("3, a, -4, news, finance",
"2, b, 1, politics",
"1, a, 0",
"2, c, 2, book,movie",
"1, a, 0",
"2, c, 2, book, movie, news",
file = "df",
sep = "\n")
# based on the first five rows, read.table determines that number of columns is five,
# and creates an incorrect data frame
df <- read.table(file = "df",
sep = ",", na.strings = "", header = FALSE, fill = TRUE)
df
Solution
# This can be solved by first counting the maximum number of columns in the text file
ncol <- max(count.fields("df", sep = ","))
# then this count is used in the col.names argument
# to handle the unknown maximum number of columns after row 5.
df <- read.table(file = "df",
sep = ",", na.strings = "", header = FALSE, fill = TRUE,
col.names = paste0("f", seq_len(ncol)))
df
# change column names as above
names(df) <- c(paste0("f", 1:3), paste0("a", 1:(ncol(df) - 3)))
df
#
# Read example data
#
txt <- "3,a,-4,news,finance\n2,b,1,politics\n1,a,0\n2,c,2,book,movie"
tc = textConnection(txt)
lines <- readLines(tc)
close(tc)
#
# Solution
#
lines_split <- strsplit(lines, split=",", fixed=TRUE)
ind <- 1:3
df <- as.data.frame(do.call("rbind", lapply(lines_split, "[", ind)))
df$V4 <- lapply(lines_split, "[", -ind)
#
# Output
#
V1 V2 V3 V4
1 3 a -4 news, finance
2 2 b 1 politics
3 1 a 0
4 2 c 2 book, movie
A place to start:
dat <- readLines(file) ## file being your file
df <- data.frame(
f1=sapply(dat_split, "[[", 1),
f2=sapply(dat_split, "[[", 2),
f3=sapply(dat_split, "[[", 3),
a=unlist( sapply(dat_split, function(x) {
if (length(x) <= 3) {
return(NA)
} else {
return(paste(x[4:length(x)], collapse=","))
}
}) )
)
and when you need to pull things out of a, you can do splitting as necessary.