Hi I was trying to read a csv file, I would like to get a vector
the inside of file like
head filetoread.csv
0610010K14Rik,0610011F06Rik,1110032F04Rik,1110034G24Rik,1500011B03Rik,1700019L03Rik,1700021K19Rik, blah,blah,...
in R session:
c <- read.csv("filetoread.csv")
> c
[1] X0610010K14Rik X0610011F06Rik X1110032F04Rik
...
> str(c)
'data.frame': 0 obs. of 2840 variables:
$ X0610010K14Rik : logi
$ X0610011F06Rik : logi
$ X1110032F04Rik : logi
$ X1110034G24Rik : logi
...
but I wanna something like:
> c
[1] "X0610010K14Rik" "X0610011F06Rik" "X1110032F04Rik" ...
str(c)
chr [1:2840] "X0610010K14Rik" "X0610011F06Rik" "X1110032F04Rik"...
We can use scan
scan("filetoread.csv", sep=',', what = "", quiet = TRUE)
#[1] "0610010K14Rik" "0610011F06Rik" "1110032F04Rik" "1110034G24Rik"
#[5] "1500011B03Rik" "1700019L03Rik" "1700021K19Rik" " blah" "blah"
I have ASCII files with data separated by $ signs.
There are 23 columns in the data, the first row is of column names, but there is inconsistency between the line endings, which causes R to import the data improperly, by shift the data left-wise with respect to their columns.
Header line:
ISR$CASE$I_F_COD$FOLL_SEQ$IMAGE$EVENT_DT$MFR_DT$FDA_DT$REPT_COD$MFR_NUM$MFR_SNDR$AGE$AGE_COD$GNDR_COD$E_SUB$WT$WT_COD$REPT_DT$OCCP_COD$DEATH_DT$TO_MFR$CONFID$REPORTER_COUNTRY
which does not end with a $ sign.
First row line:
7215577$8135839$I$$7215577-0$20101011$$20110104$DIR$$$67$YR$F$N$220$LBS$20110102$CN$$N$Y$UNITED STATES$
Which does end with a $ sign.
My import command:
read.table(filename, header=TRUE, sep="$", comment.char="", header=TRUE, quote="")
My guess is that the inconsistency between the line endings causes R to think that the records have one column more than the header, thus making the first column as a row.names column, which is not correct. Adding the specification row.names=NULL does not fix the issue.
If I manually add a $ sign in the file the problem is solved, but this is infeasible as the issue occurs in hundreds of files. Is there a way to specify how to read the header line? Do I have any alternative?
Additional info: the headers change across different files, so I cannot set my own vector of column names
Create a dummy test file:
cat("ISR$CASE$I_F_COD$FOLL_SEQ$IMAGE$EVENT_DT$MFR_DT$FDA_DT$REPT_COD$MFR_NUM$MFR_SNDR$AGE$AGE_COD$GNDR_COD$E_SUB$WT$WT_COD$REPT_DT$OCCP_COD$DEATH_DT$TO_MFR$CONFID$REPORTER_COUNTRY\n7215577$8135839$I$$7215577-0$20101011$$20110104$DIR$$$67$YR$F$N$220$LBS$20110102$CN$$N$Y$UNITED STATES$",
file="deleteme.txt",
"\n")
Solution using gsub:
First read the file as text and then edit its content:
file_path <- "deleteme.txt"
fh <- file(file_path)
file_content <- readLines(fh)
close(fh)
Either add a $ at the end of header row:
file_content[1] <- paste0(file_content, "$")
Or remove $ from the end of all rows:
file_content <- gsub("\\$$", "", file_content)
Then we write the fixed file back to disk:
cat(paste0(file_content, collapse="\n"), file=paste0("fixed_", file_path), "\n")
Now we can read the file:
df <- read.table(paste0("fixed_", file_path), header=TRUE, sep="$", comment.char="", quote="", stringsAsFactors=FALSE)
And get the desired structure:
str(df)
'data.frame': 1 obs. of 23 variables:
$ ISR : int 7215577
$ CASE : int 8135839
$ I_F_COD : chr "I"
$ FOLL_SEQ : logi NA
$ IMAGE : chr "7215577-0"
$ EVENT_DT : int 20101011
$ MFR_DT : logi NA
$ FDA_DT : int 20110104
$ REPT_COD : chr "DIR"
$ MFR_NUM : logi NA
$ MFR_SNDR : logi NA
$ AGE : int 67
$ AGE_COD : chr "YR"
$ GNDR_COD : logi FALSE
$ E_SUB : chr "N"
$ WT : int 220
$ WT_COD : chr "LBS"
$ REPT_DT : int 20110102
$ OCCP_COD : chr "CN"
$ DEATH_DT : logi NA
$ TO_MFR : chr "N"
$ CONFID : chr "Y"
$ REPORTER_COUNTRY: chr "UNITED STATES "
I used R and I have this string as a row and I need to split it to be as columns
'id': 1050442590754103297, 'id_str': '1050442590754103297', 'name':
'ام رودينا ', 'screen_name': 'uclkGkQ5', 'location': None,
'url': None, 'description': '\u200f\u200fمن زوي الاحتياجات
الخاصه', 'translator_type': 'none', 'protected': False,
'verified': False, 'followers_count': 1567, 'friends_count': 4019,
'listed_count': 0, 'favourites_count': 6669, 'statuses_count': 9279,
'created_at': 'Thu Oct 11 17:46:44 +0000 2018', 'utc_offset': None,
'time_zone': None, 'geo_enabled': False, 'lang': 'ar',
'contributors_enabled': False, 'is_translator': False,
'profile_background_color': 'F5F8FA', 'profile_background_image_url':
'', 'profile_background_image_url_https': '',
'profile_background_tile': False, 'profile_link_color': '1DA1F2',
'profile_sidebar_border_color': 'C0DEED',
'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color':
'333333', 'profile_use_background_image': True, 'profile_image_url':
'http://pbs.twimg.com/profile_images/1059769079790268416/sJpep_V8_normal.jpg',
'profile_image_url_https':
'https://pbs.twimg.com/profile_images/1059769079790268416/sJpep_V8_normal.jpg',
'profile_banner_url':
'https://pbs.twimg.com/profile_banners/1050442590754103297/1539390015',
'default_profile': True, 'default_profile_image': False, 'following':
None, 'follow_request_sent': None, 'notifications': None
I tried this code is worked but I need to specify the number of columns that I need and also I need to rename the columns at the end, so it is difficult and takes time
d<-str_split_fixed(try$user, ",", 4)
the result I got is and it is without columns name :
'id': 1050442590754103297 'id_str': '1050442590754103297' 'name': 'ام رودينا <U+267F>'
and the fourth column has the rest of the string
'screen_name': 'uclkGkQ5', 'location': None, 'url': None, 'description': '\u200f\u200fمن زوي الاحتياجات الخاصه<U+267F>', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 1567, 'friends_count': 4019, 'listed_count': 0, 'favourites_count': 6669, 'statuses_count': 9279, 'created_at': 'Thu Oct 11 17:46:44 +0000 2018', 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'lang': 'ar', 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': 'F5F8FA', 'profile_background_image_url': '', 'profile_background_image_url_https': '', 'profile_background_tile': False, 'profile_link_color': '1DA1F2', 'profile_sidebar_border_color': 'C0DEED', 'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color': '333333', 'profile_use_background_image': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/1059769079790268416/sJpep_V8_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1059769079790268416/sJpep_V8_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/1050442590754103297/1539390015', 'default_profile': True, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None
I need a code that split the rows based on comma and makes the column name the word before (:) as shown here:
id id_str name screen_name
1050442590754103297 1050442590754103297 ام رودينا \u267f uclkGkQ5
and the same for the rest of string
hope you understand me and thank you
This mimics (but is not legal) JSON. One method (if assumptions are true) would be to "convert to JSON" and parse from there.
Up front: I have an issue on my R session where the arabic letters are not stored correctly in the string. This happens before the calls to gsub, etc, so I believe that it might work just fine on your machine. You will see empty strings in the output as a consequence of this. (Vice hunting this down locally, I wanted to give you a shot at the code first.)
Assumptions:
there are no embedded double-quotes
the True/False/None literal strings are never embedded within text as other than the logical values (e.g., 'screen_name':'Is None' does not happen)
you are willing to find NULL within your data, where there previously was :None
Important conversions:
True and False to lowercase
None to null
encapsulate the entire thing in a dictionary with surrounding { and }
convert all single-quotes to double-quotes
One can use the magrittr pipe here for readability, or you can just nest all of the functions (barely faster):
out <- jsonlite::fromJSON(
paste0("{", gsub(":\\s*True\\s*(,?)", ":true\\1",
gsub(":\\s*False\\s*(,?)", ":false\\1",
gsub(":\\s*None\\s*(,?)", ":null\\1",
gsub("'", '"', s)))),
"}"))
# or
library(magrittr)
out <- s %>%
gsub(":\\s*True\\s*(,?)", ":true\\1", .) %>%
gsub(":\\s*False\\s*(,?)", ":false\\1", .) %>%
gsub(":\\s*None\\s*(,?)", ":null\\1", .) %>%
gsub("'", '"', .) %>%
paste0("{", ., "}") %>%
jsonlite::fromJSON(.)
The results (compacted with str):
str(out)
# List of 39
# $ id : num 1.05e+18
# $ id_str : chr "1050442590754103297"
# $ name : chr " "
# $ screen_name : chr "uclkGkQ5"
# $ location : NULL
# $ url : NULL
# $ description : chr "<U+200F><U+200F> "
# $ translator_type : chr "none"
# $ protected : logi FALSE
# $ verified : logi FALSE
# $ followers_count : int 1567
# $ friends_count : int 4019
# $ listed_count : int 0
# $ favourites_count : int 6669
# $ statuses_count : int 9279
# $ created_at : chr "Thu Oct 11 17:46:44 +0000 2018"
# $ utc_offset : NULL
# $ time_zone : NULL
# $ geo_enabled : logi FALSE
# $ lang : chr "ar"
# $ contributors_enabled : logi FALSE
# $ is_translator : logi FALSE
# $ profile_background_color : chr "F5F8FA"
# $ profile_background_image_url : chr ""
# $ profile_background_image_url_https: chr ""
# $ profile_background_tile : logi FALSE
# $ profile_link_color : chr "1DA1F2"
# $ profile_sidebar_border_color : chr "C0DEED"
# $ profile_sidebar_fill_color : chr "DDEEF6"
# $ profile_text_color : chr "333333"
# $ profile_use_background_image : logi TRUE
# $ profile_image_url : chr "http://pbs.twimg.com/profile_images/1059769079790268416/sJpep_V8_normal.jpg"
# $ profile_image_url_https : chr "https://pbs.twimg.com/profile_images/1059769079790268416/sJpep_V8_normal.jpg"
# $ profile_banner_url : chr "https://pbs.twimg.com/profile_banners/1050442590754103297/1539390015"
# $ default_profile : logi TRUE
# $ default_profile_image : logi FALSE
# $ following : NULL
# $ follow_request_sent : NULL
# $ notifications : NULL
Notes:
reminder: empty spaces here are a problem with my local "locale" setting, not necessarily the process here (sorry, I don't often process non-ASCII on my console)
the large-integer id field is converted to numeric and cannot be an integer without use of the bit64 package, I don't know if this will be a problem for you
I tried to be "careful" with replacements, and be very specific with the regex patterns, looking for a preceding colon, no ticks, and an optional follow-on comma (because the last None is not followed by anything); being as specific as possible mitigates the possibility of incorrectly replacing these literals within a string
I stress again the assumption of only-single-quotes ... if double-quotes are present, they will mess up the whole thing, and I haven't put thought into dealing with them
Hi I was trying to read a csv file, I would like to get a vector
the inside of file like
head filetoread.csv
0610010K14Rik,0610011F06Rik,1110032F04Rik,1110034G24Rik,1500011B03Rik,1700019L03Rik,1700021K19Rik, blah,blah,...
in R session:
c <- read.csv("filetoread.csv")
> c
[1] X0610010K14Rik X0610011F06Rik X1110032F04Rik
...
> str(c)
'data.frame': 0 obs. of 2840 variables:
$ X0610010K14Rik : logi
$ X0610011F06Rik : logi
$ X1110032F04Rik : logi
$ X1110034G24Rik : logi
...
but I wanna something like:
> c
[1] "X0610010K14Rik" "X0610011F06Rik" "X1110032F04Rik" ...
str(c)
chr [1:2840] "X0610010K14Rik" "X0610011F06Rik" "X1110032F04Rik"...
We can use scan
scan("filetoread.csv", sep=',', what = "", quiet = TRUE)
#[1] "0610010K14Rik" "0610011F06Rik" "1110032F04Rik" "1110034G24Rik"
#[5] "1500011B03Rik" "1700019L03Rik" "1700021K19Rik" " blah" "blah"
I have a rather simple data frame
str(match)
'data.frame': 261 obs. of 2 variables:
$ country: chr "Afghanistan" "Albania" "Algeria" "American Samoa" ...
$ match : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
If I write this to a .csv file by running a single line, everything works fine.
write.csv(match.df, file = "match.csv")
However, when using knitr to produce a .pdf I (of course) use chunks.
<<worldmapmatch>>=
exo$match <- is.na(match(exo$iso2c, wrld_simpl#data$ISO2))
match <- exo[, c("country", "match")]
match.df <- data.frame(match)
head(match.df)
str(match.df)
write.csv(match.df, file = "match.csv")
save(exo, file = "exo.RData")
#
In this case I receive an error message.
Dimension too large.
The error message is given for the line which states "write..."
Any clues?