I have several ASCII files I Need to Import into R with return data for different asset classes. The structure of the ASCII files is as follows (with 2 sample data)
How can I Import this? I wasn't succesfull with read.table, but I would like to have it in a data.frame Format.
<Security Name> <Ticker> <Per> <Date> <Close>
Test Description,Test,D,19700101,1.0000
Test Description,Test,D,19700102,1.5
If you really want to force the column names into R, you could use something like that:
# Data
dat <- read.csv("/path/to/data.dat", header = FALSE, skip = 1)
dat
V1 V2 V3 V4 V5
1 Test Description Test D 19700101 1.0
2 Test Description Test D 19700102 1.5
# Column names
dat.names <- readLines("/path/to/data.dat", n = 1)
names(dat) <- unlist(strsplit(gsub(">", " ", gsub("<", "", dat.names)), " "))
dat
Security Name Ticker Per Date Close
1 Test Description Test D 19700101 1.0
2 Test Description Test D 19700102 1.5
Although I think there might be better solutions, e.g. manually adding the header...
You can easily read this data using read.csv. Since your column names are not comma separated then you will need to use the header=FALSE argument and then add the names once the data is in R or oyu can manually edit the data before reading it by omitting the <> characters and adding a comma between each column name.
Related
Good afternoon,
have a folder with 231 .csv files and I would like to merge them in R. Each file is one spectrum with 2 columns (Wavenumber and Reflectance), but as they come from the spectrometer they don't have colnames. So they look like this when I import them:
C_Sycamore = read.csv("#C_SC_1_10 average.CSV", header = FALSE)
head(C_Sycamore)
V1 V2
1 399.1989 7.750676e+001
2 401.1274 7.779499e+001
3 403.0559 7.813432e+001
4 404.9844 7.837078e+001
5 406.9129 7.837600e+001
6 408.8414 7.822227e+001
The first column (Wavenumber) is identical in all 231 files and all spectra contain exactly 1869 rows. Therefore, it should be possible to merge the whole folder in one big dataframe, right? At least this would very practical for me.
So what I tried is this. I set the working directory to the according folder. Define an empty variable d. Store all the file names in file.list. And the loop through the names in the file.list. First, I want to change the colnames of every file to "Wavenumber" and "the according file name itself", so I use deparse(substitute(i)). Then, I want to read in the file and merge it with the others. And then I could probably do merge(d, read.csv(i, header = FALSE, by = "Wavenumber"), but I don't even get this far.
d = NULL
file.list = list.files()
for(i in file.list){
colnames(i) = c("Wavenumber", deparse(substitute(i)))
d = merge(d, read.csv(i, header = FALSE))
}
When I run this I get the error code
"Error in colnames<-(*tmp*, value = c("Wavenumber", deparse(substitute(i)))) :
So I tried running it without the "colnames()" line, which does not produce an error code, but doesn't work either. Instead of my desired dataframe I get am empty dataframe with only two columns and the message:
"reread"#S_BE_1_10 average.CSV" "#S_P_1_10 average.CSV""
This kind of programming is new to me. So I am thankful for all useful suggestions. Also I am happy to share more data if it helps.
Thanks in advance.
Solution
library(tidyr)
library(purrr)
path <- "your/path/to/folder"
# in one pipeline:
C_Sycamore <- path %>%
# get csvs full paths. (?i) is for case insentitive
list.files(pattern = "(?i)\\.csv$", full.names = TRUE) %>%
# create a named vector: you need it to assign ids in the next step.
# and remove file extection to get clean colnames
set_names(tools::file_path_sans_ext(basename(.))) %>%
# read file one by one, bind them in one df and create id column
map_dfr(read.csv, col.names = c("Wavenumber", "V2"), .id = "colname") %>%
# pivot to create one column for each .id
pivot_wider(names_from = colname, values_from = V2)
Explanation
I would suggest not to change the working directory.
I think it's better if you read from that folder instead.
You can read each CSV file in a loop and bind them together by row. You can use map_dfr to loop over each item and then bind every dataframe by row (that's what the _dfr stands for).
Note that I've used .id = to create a new column called colname. It gets populated out of the names of the vector you're looping over. (That's why we added the names with set_names)
Then, to have one row for each Wavenumber, you need to reshape your data. You can use pivot_wider.
You will have at the end a dataframe with as many rows as Wavenumber and as many columns as the number of CSV plus 1 (the wavenumber column).
Reproducible example
To double check my results, you can use this reproducible example:
path <- tempdir()
csv <- "399.1989,7.750676e+001
401.1274,7.779499e+001
403.0559,7.813432e+001
404.9844,7.837078e+001
406.9129,7.837600e+001
408.8414,7.822227e+001"
write(csv, file.path(path, "file1.csv"))
write(csv, file.path(path, "file2.csv"))
You should expect this output:
C_Sycamore
#> # A tibble: 5 x 3
#> Wavenumber file1 file2
#> <dbl> <dbl> <dbl>
#> 1 401. 77.8 77.8
#> 2 403. 78.1 78.1
#> 3 405. 78.4 78.4
#> 4 407. 78.4 78.4
#> 5 409. 78.2 78.2
Thanks a lot to #Konrad Rudolph for the suggestions!!
no need for a loop here simply use lapply.
first set your working directory to file location###
library(dplyr)
files_to_upload<-list.files(, pattern = "*.csv")
theData_list<-lapply(files_to_upload, read.csv)
C_Sycamore <-bind_rows(theData_list)
I have a csv file with a variable (id). In excel when I check the format of the cells, some cells are type general and some are scientific:
# id
# 1 ge189839898 #general format cell in excel
# 2 we7267178 #general format cell in excel
# 3 2.8E+12 #scientific format cell in excel
When I read the file into R using read_csv, it thinks that the column is character (which it is and what I want) but it means 2.8E+12 is also a character.
options(digits = 22, scipen = 9999)
library(tidyverse)
dfcsv <- read_csv("file.csv")
#where dfcsv looks like:
dfcsv <- data.frame(id = c("ge189839898",
"we7267178",
"2.8E+12"))
dfcsv
# id
# 1 ge189839898
# 2 we7267178
# 3 2.8E+12
Is there a way to automatically read in the csv so that variables with mixed types are correctly identified so it would be return a character variable but scientific notation is expanded:
# id
# 1 ge189839898
# 2 we7267178
# 3 2800000000000
I don't think guess_max is what I am after here. I would also prefer not to use grep/sprintf type solutions (if possible) as I think that is trying to fix a problem I shouldn't have? I found these problematic ids by chance so I would like an automated way of doing this at the reading in stage.
The cleanest solution is probably to go in to the csv file and make a conversion there but I want to do it through R.
Thanks
id <- c("ge189839898", "we7267178", "2.8E+12")
func <- function(x) {
poss_num <- suppressWarnings(as.numeric(x))
isna <- is.na(poss_num)
x[!isna] <- format(poss_num[!isna], scientific = FALSE)
x
}
func(id)
# [1] "ge189839898" "we7267178" "2800000000000"
I am trying to import a csv file into R-Studio. The columns are separated by a comma, but the problem is that one column contains a String and this String sometimes is only formed by chars, sometimes it contains a semicolon (like "abcdefg33;asbfsk2ala;shcjd22l"). In any case this string should not be separated, the semicolons are not separators.
What happens is that for these lines where this column contains semicolons, nothing is separated.
The other lines instead work well.
The result looks like this:
Column1 Column2 Column3
a 12 abc12
b 222 bbbb222
c,333,abcdefg33;asbfsk2ala;shcjd22l
d 282 ddbb232
To import the data I tryed using this code, but in both case I get the result above.
data <- read.csv("Test.csv")
and
data <- read.csv("Test.csv", sep = ",", strip.white = TRUE)
Does anybody know how I can fix it?
Thank you!
I can simulate your result only if I explicitly add the double quotes in the csv file (e.g. with Notepad++):
a,12,1bc12
b,222,bbbb222
"c,333,abcdefg33;asbfsk2ala;shcjd22l"
d,282,ddbb232
In this case the resulting data frame looks like yours:
> data
V1 V2 V3
1 a 12 1bc12
2 b 222 bbbb222
3 c,333,abcdefg33;asbfsk2ala;shcjd22l NA
4 d 282 ddbb232
My suggestion would be to ensure that your csv file does not contain the quotes.
Otherwise, you could use readLines to read the object line by line and then use e.g. regex to get rid of the quotes.
fread from data.table may help you:
library(data.table)
data4 <- fread("data_62871591.csv", sep = ",", quote = "")
Reads this file as follows:
> data4
V1 V2 V3
1: a 12 1bc12
2: b 222 bbbb222
3: "c 333 abcdefg33;asbfsk2ala;shcjd22l"
4: d 282 ddbb232
And as you can see there is still some post processing required to get rid of the quotes on row 3, columns V1 and V3.
How to keep every fifth row (and deleting all the others) in a file Excel? For example, I have a starting file like this:
07/12/1989 106,9
08/12/1989 106,05
12/12/1989 103,1
13/12/1989 106,5
14/12/1989 104,75
15/12/1989 105,6
18/12/1989 104,5
19/12/1989 106,2
20/12/1989 106,5
21/12/1989 107,5
22/12/1989 109,8
and I would like the result:
07/12/1989 106,9
15/12/1989 105,6
22/12/1989 109,8
Try this:
Step 1: Read excel file in R using read.xlsx
Step 2: Generate the sequences and then retrieve rows based on sequences
indexes<-seq(1,nrow(df),5) # Set index
df[indexes,] # Retrive only index
Output:
V1 V2
1 07/12/1989 106,9
6 15/12/1989 105,6
11 22/12/1989 109,8
Step 3: Store this result to excel file using write.xlsx
Let assume you have this dataset:
dt<-data.frame(ID=LETTERS, stringsAsFactors = F)
Then you can do:
as.data.frame( dt[ 1:nrow(dt) %% 5 ==0,])
I've been trying to import a .csv file (comma seperated), however there is one column in JSON format. This gives problems while trying to import the data as a dataframe in one go. I've been trying read.table and read.csv, but I cannot find the right solution (or similar questions on stack).
Is there an easy way to load the dataframe, and depict the 1 JSON column as string column? e.g. "[{"..." : "..." , "...": "..."}]" Basically everything between the "[{ }]" should end up in 1 column in the final dataframe.
It's extra challenging (for me) since the ',' is present in the JSON column, and should not be regarded as split there, but should for splitting the rest of the columns.
Desired output:
df =
V1 V2 V3 JSONCOLUMN
x y z "[{"..." : "..." , "...": "..."}]"
Here is a somewhat hacky way:
# Read csv
df <- read.csv(text =
'"Date", "Time", "Number", "Fun", "JSON_COLUMN"
"2-2-1900", "14:09:56", 4, TRUE, "[{"message":"nothing","description": "hello", "otherField": "ciao"}])"');
# Add double quotes for keys and values
df$JSON_COLUMN <- gsub("(\\w+):(\\s*)(\\w+)", "\"\\1\":\\2\"\\3\"", df$JSON_COLUMN)
df;
#Date Time Number Fun
#1 2-2-1900 14:09:56 4 TRUE
# JSON_COLUMN
#1 [{"message":"nothing","description": "hello", "otherField": "ciao"}])
Explanation: Since read.csv strips double quotes from the JSON string, we add them back in using a regular expression.