How to create specefic columns out of text in r - r

Here is just an example I hope you can help me with, given that the input is a line from a txt file, I want to transform it into a table (see output) and save it as a csv or tsv file.
I have tried with separate functions but could not get it right.
Input
"PR7 - Autres produits d'exploitation 6.9 371 667 1 389"
Desired output
Variable
note
2020
2019
2018
PR7 - Autres produits d'exploitation
6.9
371
667
1389

I'm assuming that this badly delimited data-set is the only place where you can read your data.
I created for the purpose of this answer an example file (that I called PR.txt) that contains only the two following lines.
PR6 - Blabla 10 156 3920 245
PR7 - Autres produits d'exploitation 6.9 371 667 1389
First I create a function to parse each line of this data-set. I'm assuming here that the original file does not contain the names of the columns. In reality, this is probably not the case. Thus this function that could be easily adapted to take a first "header" line into account.
readBadlyDelimitedData <- function(x) {
# Read the data
dat <- read.table(text = x)
# Get the type of each column
whatIsIt <- sapply(dat, typeof)
# Combine the columns that are of type "character"
variable <- paste(dat[whatIsIt == "character"], collapse = " ")
# Put everything in a data-frame
res <- data.frame(
variable = variable,
dat[, whatIsIt != "character"])
# Change the names
names(res)[-1] <- c("note", "Year2021", "Year2020", "Year2019")
return(res)
}
Note that I do not call the columns with the yearly figure by only "numeric" names because giving rows or columns purely "numerical" names is not a good practice in R.
Once I have this function, I can (l)apply it to each line of the data by combining it with readLines, and collapse all the lines with an rbind.
out <- do.call("rbind", lapply(readLines("tests/PR.txt"), readBadlyDelimitedData))
out
variable note Year2021
1 PR6 - Blabla 10.0 156
2 PR7 - Autres produits d'exploitation 6.9 371
Year2020 Year2019
1 3920 245
2 667 1389
Finally, I save the result with read.csv :
read.csv(out, file = "correctlyDelimitedFile.csv")
If you can get your hands on the Excel file, a simple gdata::read.xls or openxlsx::read.xlsx would be enough to read the data.
I wish I knew how to make the script simpler... maybe a tidyr magic person would have a more elegant solution?

Related

Is there anyway to read .dat file from movielens to R studio

I am trying to use Import Dataset in R Studio to read ratings.dat from movielens.
Basically it has this format:
1::1::5::978824268
1::1022::5::978300055
1::1028::5::978301777
1::1029::5::978302205
1::1035::5::978301753
So I need to replace :: by : or ' or white spaces, etc. I use notepad++, it helps to load the file quite fast (compare to note) and can view very big file easily. However, when I do replacement, it shows some strange characters:
"LF"
as I do some research here, it said that it is \n (line feed or line break). But I do not know why when it load the file, it do not show these, only when I do replacement then they appear. And when I load into R Studio, it still detect as "LF", not line break and cause error in data reading.
What is the solution for that ? Thank you !
PS: I know there is python code for converting this but I don't want to use it, is there any other ways ?
Try this:
url <- "http://files.grouplens.org/datasets/movielens/ml-10m.zip"
## this part is agonizingly slow
tf <- tempfile()
download.file(url,tf, mode="wb") # download archived movielens data
files <- unzip(tf, exdir=tempdir()) # unzips and returns a vector of file names
ratings <- readLines(files[grepl("ratings.dat$",files)]) # read rating.dat file
ratings <- gsub("::", "\t", ratings)
# this part is much faster
library(data.table)
ratings <- fread(paste(ratings, collapse="\n"), sep="\t")
# Read 10000054 rows and 4 (of 4) columns from 0.219 GB file in 00:00:07
head(ratings)
# V1 V2 V3 V4
# 1: 1 122 5 838985046
# 2: 1 185 5 838983525
# 3: 1 231 5 838983392
# 4: 1 292 5 838983421
# 5: 1 316 5 838983392
# 6: 1 329 5 838983392
Alternatively (use the d/l code from jlhoward but he also updated his code to not use built-in functions and switch to data.table while i wrote this, but mine's still faster/more efficient :-)
library(data.table)
# i try not to use variable names that stomp on function names in base
URL <- "http://files.grouplens.org/datasets/movielens/ml-10m.zip"
# this will be "ml-10m.zip"
fil <- basename(URL)
# this will download to getwd() since you prbly want easy access to
# the files after the machinations. the nice thing about this is
# that it won't re-download the file and waste bandwidth
if (!file.exists(fil)) download.file(URL, fil)
# this will create the "ml-10M100K" dir in getwd(). if using
# R 3.2+ you can do a dir.exists() test to avoid re-doing the unzip
# (which is useful for large archives or archives compressed with a
# more CPU-intensive algorithm)
unzip(fil)
# fast read and slicing of the input
# fread will only spit on a single delimiter so the initial fread
# will create a few blank columns. the [] expression filters those
# out. the "with=FALSE" is part of the data.table inanity
mov <- fread("ml-10M100K/ratings.dat", sep=":")[, c(1,3,5,7), with=FALSE]
# saner column names, set efficiently via data.table::setnames
setnames(mov, c("user_id", "movie_id", "tag", "timestamp"))
mov
## user_id movie_id tag timestamp
## 1: 1 122 5 838985046
## 2: 1 185 5 838983525
## 3: 1 231 5 838983392
## 4: 1 292 5 838983421
## 5: 1 316 5 838983392
## ---
## 10000050: 71567 2107 1 912580553
## 10000051: 71567 2126 2 912649143
## 10000052: 71567 2294 5 912577968
## 10000053: 71567 2338 2 912578016
## 10000054: 71567 2384 2 912578173
It's quite a bit faster than built-in functions.
Small improvement to #hrbrmstr's answer:
mov <- fread("ml-10M100K/ratings.dat", sep=":", select=c(1,3,5,7))

Remove blank lines in txt output from R

I am trying to create a specifically formatted file to use as an input file in another software. I have been able, with the help of people here, to create a file that is almost there. Now I just need to remove some empty lines in my *.txt output file. I have tried several different approaches with gsub() but can't figure out a way. Below is an example that produces a file that shows where I'm stuck.
matsplitter<-function(M, r, c) {
rg <- (row(M)-1)%/%r+1
cg <- (col(M)-1)%/%c+1
rci <- (rg-1)*max(cg) + cg
N <- prod(dim(M))/r/c
cv <- unlist(lapply(1:N, function(x) M[rci==x]))
dim(cv)<-c(r,c,N)
cv}
B <- matrix(c(1:1380),ncol=5)
capture.output(matsplitter(B,3,5), file='output.txt')
write.table(gsub('\\[.*\\]', '',
readLines('output.txt')),
file='output.txt', row.names=FALSE, quote=FALSE)
What I need to further remove are the two blank lines between the ", , 1", ", , 2" etc. string and the matrix of numbers.
x
, , 1
1 277 553 829 1105
2 278 554 830 1106
3 279 555 831 1107
, , 2
4 280 556 832 1108
5 281 557 833 1109
6 282 558 834 1110
, , 3
7 283 559 835 1111
8 284 560 836 1112
9 285 561 837 1113
A possible solution if you are willing to go beyond gsub. I have taken the liberty of breaking the answer up into pieces for clarity (hopefully).
#read in file created by "capture.out"
out = gsub('\\[.*\\]', '', readLines('output.txt'))
If you look at this object out you will see that blocks seem separated by five spaces, and that the first of the two spaces you want to get rid of is an empty string "". We get rid of the multiple spaces by means of:
out = gsub("\\s{5}","",out)
Now after the header but in front of every block there is two empty strings and after every block there is one empty string. As we only look to exclude spaces in front of blocks we use the function rle to find repeating elements and exclude these.
#get indicator vector
exclvec = rep(rle(out)$lengths,rle(out)$lengths)
#remove values as indicated by exclvec
out = out[ifelse(out=="" & exclvec==2,F,T)]
As i interpret your question writing this dataframe provides the desired result.
write.table(out,file='output.txt', row.names=FALSE, quote=FALSE)

R correct use of read.csv

I must be misunderstanding how read.csv works in R. I have read the help file, but still do not understand how a csv file containing:
40900,-,-,-,241.75,0
40905,244,245.79,241.25,244,22114
40906,244,246.79,243.6,245.5,18024
40907,246,248.5,246,247,60859
read into R using: euk<-data.matrix(read.csv("path\to\csv.csv"))
produces this as a result (using tail):
Date Open High Low Close Volume
[2713,] 15329 490 404 369 240.75 62763
[2714,] 15330 495 409 378 242.50 127534
[2715,] 15331 1 1 1 241.75 0
[2716,] 15336 504 425 385 244.00 22114
[2717,] 15337 504 432 396 245.50 18024
[2718,] 15338 512 442 405 247.00 60859
It must be something obvious that I do not understand. Please be kind in your responses, I am trying to learn.
Thanks!
The issue is not with read.csv, but with data.matrix. read.csv imports any column with characters in it as a factor. The '-' in the first row for your dataset are character, so the column is converted to a factor. Now, you pass the result of the read.csv into data.matrix, and as the help states, it replaces the levels of the factor with it's internal codes.
Basically, you need to insure that the columns of your data are numeric before you pass the data.frame into data.matrix.
This should work in your case (assuming the only characters are '-'):
euk <- data.matrix(read.csv("path/to/csv.csv", na.strings = "-", colClasses = 'numeric'))
I'm no R expert, but you may consider using scan() instead, eg:
> data = scan("foo.csv", what = list(x = numeric(), y = numeric()), sep = ",")
Where foo.csv has two columns, x and y, and is comma delimited. I hope that helps.
I took a cut/paste of your data, put it in a file and I get this using 'R'
> c<-data.matrix(read.csv("c:/DOCUME~1/Philip/LOCALS~1/Temp/x.csv",header=F))
> c
V1 V2 V3 V4 V5 V6
[1,] 40900 1 1 1 241.75 0
[2,] 40905 2 2 2 244.00 22114
[3,] 40906 2 3 3 245.50 18024
[4,] 40907 3 4 4 247.00 60859
>
There must be more in your data file, for one thing, data for the header line. And the output you show seems to start with row 2713. I would check:
The format of the header line, or get rid of it and add it manually later.
That each row has exactly 6 values.
The the filename uses forward slashes and has no embedded spaces
(use the 8.3 representation as shown in my filename).
Also, if you generated your csv file from MS Excel, the internal representation for a date is a number.

Read a CSV file in R, and select each element

Sorry if the title is confusing. I can import a CSV file into R, but once I would like to select one element by providing the row and col index. I got more than one elements. All I want is to use this imported csv as a data.frame, which I can select any columns, rows and single cells. Can anyone give me some suggestions?
Here is the data:
SKU On Off Duration(hr) Sales
C010100100 2/13/2012 4/19/2012 17:00 1601 238
C010930200 5/3/2012 7/29/2012 0:00 2088 3
C011361100 2/13/2012 5/25/2012 22:29 2460 110
C012000204 8/13/2012 11/12/2012 11:00 2195 245
C012000205 8/13/2012 11/12/2012 0:00 2184 331
CODE:
Dat = read.table("Dat.csv",header=1,sep=',')
Dat[1,][1] #This is close to what I need but is not exactly the same
SKU
1 C010100100
Dat[1,1] # Ideally, I want to have results only with C010100100
[1] C010100100
3861 Levels: B013591100 B024481100 B028710300 B038110800 B038140800 B038170900 B038260200 B038300700 B040580700 B040590200 B040600400 B040970200 ... YB11624Q1100
Thanks!
You can convert to character to get the value as a string, and no longer as a factor:
as.character(Dat[1,1])
You have just one element, but the factor contains all levels.
Alternatively, pass the option stringsAsFactors=FALSE to read.table when you read the file, to prevent creation of factors for character values:
Dat = read.table("Dat.csv",header=1,sep=',', stringsAsFactors=FALSE )

R - pipe("pbcopy") columns not lining up with pasting

As a follow-up to a question I wrote a few days ago, I finally figured out how to copy to the clipboard to paste into other applications (read: Excel).
However, when using the function to copy and paste, the variable column headers are not lining up correctly when pasting.
Data (taken from a Flowing Data example I happened to be looking at):
data <- read.csv("http://datasets.flowingdata.com/post-data.txt")
Copy function:
write.table(file = pipe("pbcopy"), data, sep = "\t")
When loaded in, the data looks like this:
id views comments category
1 5019 148896 28 Artistic Visualization
2 1416 81374 26 Visualization
3 1416 81374 26 Featured
4 3485 80819 37 Featured
5 3485 80819 37 Mapping
6 3485 80819 37 Data Sources
There is a row number without a column variable name (1, 2, 3, 4, ...)
Using the read.table(pipe("pbpaste")) function, it will load back into R fine.
However, when I paste it into Excel, or TextEdit, the column name for the second variable will be in the first variable column name slot, like this:
id views comments category
1 5019 148896 28 Artistic Visualization
2 1416 81374 26 Visualization
3 1416 81374 26 Featured
4 3485 80819 37 Featured
5 3485 80819 37 Mapping
6 3485 80819 37 Data Sources
Which leaves the trailing column without a column name.
Is there a way to ensure the data copied to the clipboard is aligned and labeled correctly?
The row numbers do not have a column name in an R data.frame. They were not in the original dataset but they are put into the output to the clipboard unless you suppress it. The default for that option is set to TRUE but you can override it. If you want such a column as a named column, you need to make it. Try this when sending to excel.
df$rownums <- rownames(df)
edf <- df[ c( length(df), 1:(length(df)-1))] # to get the rownums/rownames first
write.table(file = pipe("pbcopy"), edf, row.names=FALSE, sep = "\t")
You may just want to add the argument col.names=NA to your call to write.table(). It has the effect of adding an empty character string (a blank column name) to the header row for the first column.
write.table(file = pipe("pbcopy"), data, sep = "\t", col.names=NA)
To see the difference, compare these two function calls:
write.table(data[1:2,], sep="\t")
# "id" "views" "comments" "category"
# "1" 5019 148896 28 "Artistic Visualization"
# "2" 1416 81374 26 "Visualization"
write.table(data[1:2,], sep="\t", col.names=NA)
# "" "id" "views" "comments" "category"
# "1" 5019 148896 28 "Artistic Visualization"
# "2" 1416 81374 26 "Visualization"

Resources