I am trying to use Import Dataset in R Studio to read ratings.dat from movielens.
Basically it has this format:
1::1::5::978824268
1::1022::5::978300055
1::1028::5::978301777
1::1029::5::978302205
1::1035::5::978301753
So I need to replace :: by : or ' or white spaces, etc. I use notepad++, it helps to load the file quite fast (compare to note) and can view very big file easily. However, when I do replacement, it shows some strange characters:
"LF"
as I do some research here, it said that it is \n (line feed or line break). But I do not know why when it load the file, it do not show these, only when I do replacement then they appear. And when I load into R Studio, it still detect as "LF", not line break and cause error in data reading.
What is the solution for that ? Thank you !
PS: I know there is python code for converting this but I don't want to use it, is there any other ways ?
Try this:
url <- "http://files.grouplens.org/datasets/movielens/ml-10m.zip"
## this part is agonizingly slow
tf <- tempfile()
download.file(url,tf, mode="wb") # download archived movielens data
files <- unzip(tf, exdir=tempdir()) # unzips and returns a vector of file names
ratings <- readLines(files[grepl("ratings.dat$",files)]) # read rating.dat file
ratings <- gsub("::", "\t", ratings)
# this part is much faster
library(data.table)
ratings <- fread(paste(ratings, collapse="\n"), sep="\t")
# Read 10000054 rows and 4 (of 4) columns from 0.219 GB file in 00:00:07
head(ratings)
# V1 V2 V3 V4
# 1: 1 122 5 838985046
# 2: 1 185 5 838983525
# 3: 1 231 5 838983392
# 4: 1 292 5 838983421
# 5: 1 316 5 838983392
# 6: 1 329 5 838983392
Alternatively (use the d/l code from jlhoward but he also updated his code to not use built-in functions and switch to data.table while i wrote this, but mine's still faster/more efficient :-)
library(data.table)
# i try not to use variable names that stomp on function names in base
URL <- "http://files.grouplens.org/datasets/movielens/ml-10m.zip"
# this will be "ml-10m.zip"
fil <- basename(URL)
# this will download to getwd() since you prbly want easy access to
# the files after the machinations. the nice thing about this is
# that it won't re-download the file and waste bandwidth
if (!file.exists(fil)) download.file(URL, fil)
# this will create the "ml-10M100K" dir in getwd(). if using
# R 3.2+ you can do a dir.exists() test to avoid re-doing the unzip
# (which is useful for large archives or archives compressed with a
# more CPU-intensive algorithm)
unzip(fil)
# fast read and slicing of the input
# fread will only spit on a single delimiter so the initial fread
# will create a few blank columns. the [] expression filters those
# out. the "with=FALSE" is part of the data.table inanity
mov <- fread("ml-10M100K/ratings.dat", sep=":")[, c(1,3,5,7), with=FALSE]
# saner column names, set efficiently via data.table::setnames
setnames(mov, c("user_id", "movie_id", "tag", "timestamp"))
mov
## user_id movie_id tag timestamp
## 1: 1 122 5 838985046
## 2: 1 185 5 838983525
## 3: 1 231 5 838983392
## 4: 1 292 5 838983421
## 5: 1 316 5 838983392
## ---
## 10000050: 71567 2107 1 912580553
## 10000051: 71567 2126 2 912649143
## 10000052: 71567 2294 5 912577968
## 10000053: 71567 2338 2 912578016
## 10000054: 71567 2384 2 912578173
It's quite a bit faster than built-in functions.
Small improvement to #hrbrmstr's answer:
mov <- fread("ml-10M100K/ratings.dat", sep=":", select=c(1,3,5,7))
Related
I have a file named data.json. It has the following contents:
{
"ID":["1","2","3","4","5","6","7","8" ],
"Name":["Rick","Dan","Michelle","Ryan","Gary","Nina","Simon","Guru" ],
"Salary":["623.3","515.2","611","729","843.25","578","632.8","722.5" ],
"StartDate":[ "1/1/2012","9/23/2013","11/15/2014","5/11/2014","3/27/2015","5/21/2013",
"7/30/2013","6/17/2014"],
"Dept":[ "IT","Operations","IT","HR","Finance","IT","Operations","Finance"]
}
In RStudio, I have installed the 'rjson' package and have the following code:
library("rjson")
myData <- fromJSON(file="data.json")
print(myData)
As per the description of the fromJSON() function, it should read the contents of 'data.json' file into a R object 'myData'. When I executed it, I got the following error:
Error in fromJSON(file = "data.json") :
not all data was parsed (0 chars were parsed out of a total of 3 chars)
I validated the structure of the 'data.json' file on https://jsonlint.com/. It was valid.
I searched stackoverflow.com and got the following page: Error in fromJSON("employee.json") : not all data was parsed (0 chars were parsed out of a total of 13 chars)
My program already complies with the answers given here but the 'data.json' file is still not getting parsed.
I would be grateful if you could point out what mistake I am making in the R program or JSON file as I am new to both.
Thank You.
I can confirm the error for rjson, but jsonlite::fromJSON appears to work.
jsonlite::fromJSON('foo.dat') |> as.data.frame()
# ID Name Salary StartDate Dept
# 1 1 Rick 623.3 1/1/2012 IT
# 2 2 Dan 515.2 9/23/2013 Operations
# 3 3 Michelle 611 11/15/2014 IT
# 4 4 Ryan 729 5/11/2014 HR
# 5 5 Gary 843.25 3/27/2015 Finance
# 6 6 Nina 578 5/21/2013 IT
# 7 7 Simon 632.8 7/30/2013 Operations
# 8 8 Guru 722.5 6/17/2014 Finance
Here is just an example I hope you can help me with, given that the input is a line from a txt file, I want to transform it into a table (see output) and save it as a csv or tsv file.
I have tried with separate functions but could not get it right.
Input
"PR7 - Autres produits d'exploitation 6.9 371 667 1 389"
Desired output
Variable
note
2020
2019
2018
PR7 - Autres produits d'exploitation
6.9
371
667
1389
I'm assuming that this badly delimited data-set is the only place where you can read your data.
I created for the purpose of this answer an example file (that I called PR.txt) that contains only the two following lines.
PR6 - Blabla 10 156 3920 245
PR7 - Autres produits d'exploitation 6.9 371 667 1389
First I create a function to parse each line of this data-set. I'm assuming here that the original file does not contain the names of the columns. In reality, this is probably not the case. Thus this function that could be easily adapted to take a first "header" line into account.
readBadlyDelimitedData <- function(x) {
# Read the data
dat <- read.table(text = x)
# Get the type of each column
whatIsIt <- sapply(dat, typeof)
# Combine the columns that are of type "character"
variable <- paste(dat[whatIsIt == "character"], collapse = " ")
# Put everything in a data-frame
res <- data.frame(
variable = variable,
dat[, whatIsIt != "character"])
# Change the names
names(res)[-1] <- c("note", "Year2021", "Year2020", "Year2019")
return(res)
}
Note that I do not call the columns with the yearly figure by only "numeric" names because giving rows or columns purely "numerical" names is not a good practice in R.
Once I have this function, I can (l)apply it to each line of the data by combining it with readLines, and collapse all the lines with an rbind.
out <- do.call("rbind", lapply(readLines("tests/PR.txt"), readBadlyDelimitedData))
out
variable note Year2021
1 PR6 - Blabla 10.0 156
2 PR7 - Autres produits d'exploitation 6.9 371
Year2020 Year2019
1 3920 245
2 667 1389
Finally, I save the result with read.csv :
read.csv(out, file = "correctlyDelimitedFile.csv")
If you can get your hands on the Excel file, a simple gdata::read.xls or openxlsx::read.xlsx would be enough to read the data.
I wish I knew how to make the script simpler... maybe a tidyr magic person would have a more elegant solution?
I want to load the data from a JSON file into R to make a new dataframe. However the JSON file consists out of other links with data, so i can't seem to find the actual data from the JSON file. I got the JSON file from this website: https://ckan.dataplatform.nl/dataset/467dc230-20e0-4c3a-8240-dccbfc20807a/resource/531cc276-b88e-49bb-a97f-443707936a12/download/p-route-autoparkeren.json
This is the code i used.
library(rjson)
JSONList1 <- fromJSON(file = "utrecht2.json")
print(JSONList1)
JSONList1_df <- as.data.frame(JSONList1)
when i use this code i get only 1 observation with 411 variables.
Any idea how to do this? I'm a beginner and i've never worked with JSON files.
Maybe try fromJSON from package jsonlite
library(jsonlite)
JSONList1 <- fromJSON("https://ckan.dataplatform.nl/dataset/467dc230-20e0-4c3a-8240-dccbfc20807a/resource/531cc276-b88e-49bb-a97f-443707936a12/download/p-route-autoparkeren.json")
There are several packages offering JSON importing abilities. If I use the one I am involved with, then the resulting data appears to contain a data.frame as the first list element.
d <- RcppSimdJson::fload("https://ckan.dataplatform.nl/dataset/467dc230-20e0-4c3a-8240-dccbfc20807a/resource/531cc276-b88e-49bb-a97f-443707936a12/download/p-route-autoparkeren.json")
> class(d)
[1] "list"
> class(d[[1]])
[1] "data.frame"
>
> head(d[[1]])
dynamicDataUrl
1 http://opendata.technolution.nl/opendata/parkingdata/v1/dynamic/8d85bbdb-8bbd-4a24-b35f-85f21186ec04
2 http://opendata.technolution.nl/opendata/parkingdata/v1/dynamic/21b0388a-56f7-4cba-8fd3-4a1c914f5fe2
3 http://opendata.technolution.nl/opendata/parkingdata/v1/dynamic/45434989-3252-4c85-8731-c856b02c390c
4 http://opendata.technolution.nl/opendata/parkingdata/v1/dynamic/9064b206-7e62-402d-ae62-f25a0e47571b
5 http://opendata.technolution.nl/opendata/parkingdata/v1/dynamic/5829fb06-ee4a-4762-946c-ed6209edf7d5
6 http://opendata.technolution.nl/opendata/parkingdata/v1/dynamic/e4da517a-ef32-426d-821c-96e29ac5ac80
staticDataUrl
1 http://opendata.technolution.nl/opendata/parkingdata/v1/static/8d85bbdb-8bbd-4a24-b35f-85f21186ec04
2 http://opendata.technolution.nl/opendata/parkingdata/v1/static/21b0388a-56f7-4cba-8fd3-4a1c914f5fe2
3 http://opendata.technolution.nl/opendata/parkingdata/v1/static/45434989-3252-4c85-8731-c856b02c390c
4 http://opendata.technolution.nl/opendata/parkingdata/v1/static/9064b206-7e62-402d-ae62-f25a0e47571b
5 http://opendata.technolution.nl/opendata/parkingdata/v1/static/5829fb06-ee4a-4762-946c-ed6209edf7d5
6 http://opendata.technolution.nl/opendata/parkingdata/v1/static/e4da517a-ef32-426d-821c-96e29ac5ac80
limitedAccess identifier name
1 FALSE 8d85bbdb-8bbd-4a24-b35f-85f21186ec04 P06 - Sluisstraat
2 FALSE 21b0388a-56f7-4cba-8fd3-4a1c914f5fe2 3 - Burcht
3 FALSE 45434989-3252-4c85-8731-c856b02c390c P01 - Stationsplein
4 FALSE 9064b206-7e62-402d-ae62-f25a0e47571b Jaarbeurs P3 - Jaarbeurs P3
5 FALSE 5829fb06-ee4a-4762-946c-ed6209edf7d5 P03 - Dek Stadspoort
6 FALSE e4da517a-ef32-426d-821c-96e29ac5ac80 PG-Pieter Vreedeplein
locationForDisplay
1 NA
2 WGS84, 52.4387428557465, 4.82805132865906
3 WGS84, 52.2573226613971, 6.16240739822388
4 WGS84, 52.0854991774024, 5.10619640350342
5 WGS84, 52.256324421386, 6.15569114685059
6 WGS84, 51.5582297848141, 5.08894979953766
>
I would expect this to be similar for the other ones.
I must be misunderstanding how read.csv works in R. I have read the help file, but still do not understand how a csv file containing:
40900,-,-,-,241.75,0
40905,244,245.79,241.25,244,22114
40906,244,246.79,243.6,245.5,18024
40907,246,248.5,246,247,60859
read into R using: euk<-data.matrix(read.csv("path\to\csv.csv"))
produces this as a result (using tail):
Date Open High Low Close Volume
[2713,] 15329 490 404 369 240.75 62763
[2714,] 15330 495 409 378 242.50 127534
[2715,] 15331 1 1 1 241.75 0
[2716,] 15336 504 425 385 244.00 22114
[2717,] 15337 504 432 396 245.50 18024
[2718,] 15338 512 442 405 247.00 60859
It must be something obvious that I do not understand. Please be kind in your responses, I am trying to learn.
Thanks!
The issue is not with read.csv, but with data.matrix. read.csv imports any column with characters in it as a factor. The '-' in the first row for your dataset are character, so the column is converted to a factor. Now, you pass the result of the read.csv into data.matrix, and as the help states, it replaces the levels of the factor with it's internal codes.
Basically, you need to insure that the columns of your data are numeric before you pass the data.frame into data.matrix.
This should work in your case (assuming the only characters are '-'):
euk <- data.matrix(read.csv("path/to/csv.csv", na.strings = "-", colClasses = 'numeric'))
I'm no R expert, but you may consider using scan() instead, eg:
> data = scan("foo.csv", what = list(x = numeric(), y = numeric()), sep = ",")
Where foo.csv has two columns, x and y, and is comma delimited. I hope that helps.
I took a cut/paste of your data, put it in a file and I get this using 'R'
> c<-data.matrix(read.csv("c:/DOCUME~1/Philip/LOCALS~1/Temp/x.csv",header=F))
> c
V1 V2 V3 V4 V5 V6
[1,] 40900 1 1 1 241.75 0
[2,] 40905 2 2 2 244.00 22114
[3,] 40906 2 3 3 245.50 18024
[4,] 40907 3 4 4 247.00 60859
>
There must be more in your data file, for one thing, data for the header line. And the output you show seems to start with row 2713. I would check:
The format of the header line, or get rid of it and add it manually later.
That each row has exactly 6 values.
The the filename uses forward slashes and has no embedded spaces
(use the 8.3 representation as shown in my filename).
Also, if you generated your csv file from MS Excel, the internal representation for a date is a number.
>titletool<-read.csv("TotalCSVData.csv",header=FALSE,sep=",")
> class(titletool)
[1] "data.frame"
>titletool[1,1]
[1] Experiment name : CONTROL DB AD_1
>t<-titletool[1,1]
>t
[1] Experiment name : CONTROL DB AD_1
>class(t)
[1] "character"
now i want to create an object (vector) with the name "Experiment name : CONTROL DB AD_1" , or even better if possible CONTROL DB AD_1
Thank you
Use assign:
varname <- "Experiment name : CONTROL DB AD_1"
assign(varname, 3.14158)
get("Experiment name : CONTROL DB AD_1")
[1] 3.14158
And you can use a regular expression and sub or gsub to remove some text from a string:
cleanVarname <- sub("Experiment name : ", "", varname)
assign(cleanVarname, 42)
get("CONTROL DB AD_1")
[1] 42
But let me warn you this is an unusual thing to do.
Here be dragons.
If I understand correctly, you have a bunch of CSV files, each with multiple experiments in them, named in the pattern "Experiment ...". You now want to read each of these "experiments" into R in an efficient way.
Here's a not-so-pretty (but not-so-ugly either) function that might get you started in the right direction.
What the function basically does is read in the CSV, identify the line numbers where each new experiment starts, grabs the names of the experiments, then does a loop to fill in a list with the separate data frames. It doesn't really bother making "R-friendly" names though, and I've decided to leave the output in a list, because as Andrie pointed out, "R has great tools for working with lists."
read.funkyfile = function(funkyfile, expression, ...) {
temp = readLines(funkyfile)
temp.loc = grep(expression, temp)
temp.loc = c(temp.loc, length(temp)+1)
temp.nam = gsub("[[:punct:]]", "",
grep(expression, temp, value=TRUE))
temp.out = vector("list")
for (i in 1:length(temp.nam)) {
temp.out[[i]] = read.csv(textConnection(
temp[seq(from = temp.loc[i]+1,
to = temp.loc[i+1]-1)]),
...)
names(temp.out)[i] = temp.nam[i]
}
temp.out
}
Here is an example CSV file. Copy and paste it into a text editor and save it as "funkyfile1.csv" in the current working directory. (Or, read it in from Dropbox: http://dl.dropbox.com/u/2556524/testing/funkyfile1.csv)
"Experiment Name: Here Be",,
1,2,3
4,5,6
7,8,9
"Experiment Name: The Dragons",,
10,11,12
13,14,15
16,17,18
Here is a second CSV. Again, copy-paste and save it as "funkyfile2.csv" in your current working directory. (Or, read it in from Dropbox: http://dl.dropbox.com/u/2556524/testing/funkyfile2.csv)
"Promises: I vow to",,
"H1","H2","H3"
19,20,21
22,23,24
25,26,27
"Promises: Slay the dragon",,
"H1","H2","H3"
28,29,30
31,32,33
34,35,36
Notice that funkyfile1 has no column names, while funkyfile2 does. That's what the ... argument in the function is for: to specify header=TRUE or header=FALSE. Also the "expression" identifying each new set of data is "Promises" in funkyfile2.
Now, use the function:
read.funkyfile("funkyfile1.csv", "Experiment", header=FALSE)
# read.funkyfile("http://dl.dropbox.com/u/2556524/testing/funkyfile1.csv",
# "Experiment", header=FALSE) # Uncomment to load remotely
# $`Experiment Name Here Be`
# V1 V2 V3
# 1 1 2 3
# 2 4 5 6
# 3 7 8 9
#
# $`Experiment Name The Dragons`
# V1 V2 V3
# 1 10 11 12
# 2 13 14 15
# 3 16 17 18
read.funkyfile("funkyfile2.csv", "Promises", header=TRUE)
# read.funkyfile("http://dl.dropbox.com/u/2556524/testing/funkyfile2.csv",
# "Experiment", header=TRUE) # Uncomment to load remotely
# $`Promises I vow to`
# H1 H2 H3
# 1 19 20 21
# 2 22 23 24
# 3 25 26 27
#
# $`Promises Slay the dragon`
# H1 H2 H3
# 1 28 29 30
# 2 31 32 33
# 3 34 35 36
Go get those dragons.
Update
If your data are all in the same format, you can use the lapply solution mentioned by Andrie along with this function. Just make a list of the CSVs that you want to load, as below. Note that the files all need to use the same "expression" and other arguments the way the function is currently written....
temp = list("http://dl.dropbox.com/u/2556524/testing/funkyfile1.csv",
"http://dl.dropbox.com/u/2556524/testing/funkyfile3.csv")
lapply(temp, read.funkyfile, "Experiment", header=FALSE)