Data loss during read.csv in R - r

I have a .csv file to be imported into R, which has more than 1K observations. However, when I used the read.csv function as usual, the imported file only has 21 observations. This is strange. I've never seen this before.
t <- read.csv("E:\\AH1_09182014.CSV",header=T, colClasses=c(rep("character",3),rep("numeric",22)),na.string=c("null","NaN",""),stringsAsFactors=FALSE)
Can anyone help me figure out the problem? I am giving a link to my data file:
https://drive.google.com/file/d/0B86_a8ltyoL3TzBza0x1VTd2OTQ/edit?usp=sharing

You have some messy characters in your data--things like embedded control characters.
A workaround is to read the file in binary mode, and use read.csv on the text file read in.
This answer proposes a basic function to do those steps.
The function looks like this:
sReadLines <- function(fnam) {
f <- file(fnam, "rb")
res <- readLines(f)
close(f)
res
}
You can use it as follows:
temp <- read.csv(text = sReadLines("~/Downloads/AH1_09182014.CSV"),
stringsAsFactors = FALSE)
Have all lines been read in?
dim(temp)
# [1] 1449 25
Where is that problem line?
unlist(temp[21, ], use.names = FALSE)
# [1] "A-H Log 1" "09/18/2014" "0:19:00" "7.866" "255" "0.009"
# [7] "525" "7" "4468" "76" "4576.76" "20"
# [13] "71" "19" "77" "1222" "33857" "-3382"
# [19] "26\032)" "18.30" "84.80" "991.43" "23713.90" "0.85"
# [25] "10.54"
^^ see item [19] above.
Because of this, you won't be able to specify all of your column types up front--unless you clean the CSV first.

Related

Cannot iterate through a list while I iterate through a list of lists

The input is a list of lists. Please see below. The file names is a list containing as many names as there are lists in the list (name1, name2, name3).
Each name is appended to the path: path/name1 - path/name2 - path/name3
The program iterates through the list containing the paths as it iterates through the list of lists and prints the paths with their file names. I would expect for the output to be path/name1 - path/name2 - path/name3. However I get the output below. Please see OUTPUT after INPUT
INPUT
[[1]]
[1] "150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt" "160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
[3] "JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt" "JF_160426_Dep2Plas_tryp_Gpep_SIDtarg-(06)_PSMs.txt"
[[2]]
[1] "150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt" "160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
[3] "JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt"
[[3]]
[1] "150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt"
"160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
OUTPUT
I would expect for the output to be path/nam1 - path/name2 - path/name3
[1] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/name1.tsv",
[1] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/name2.tsv",
[1] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/name3.tsv".
However I get the output below:
[1] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/name1.tsv"
I cannot understand why I cannot iterate through the list of paths with the file name while iterating through the list of lists. I hope this helps to clarify the problem. Could anyone help with this?
I have analyzed each statement using printing and every thing works fine except for the output of the code below
for (i in 1:length(lc)) {
for (j in 1:length(lc[[i]])) { # fetch and read files
if (j==1) {
newFile <- paste(dataFnsDir, lc[[i]][j], sep="/")
newFile <- tryCatch(read.delim(newFile, header = TRUE, sep = '/'), error=function(e) NULL)
newFile<- tryCatch(newFile, error=function(e) data.frame())
print(tmpFn[i])
} else {
newFile <- paste(dataFnsDir, lc[[i]][j], sep="/")
newFile <- tryCatch(read.delim(newFilei, header = TRUE, sep = '/'), error=function(e) NULL)
newFile <- tryCatch(newFile, error=function(e) data.frame())
newFile <- dplyr::bind_rows(newFile, newFile)
print(tmpFn[i])
}
}
}
There's no need to use nested loop. try this:
# sample data
dataFnsDir <- "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/"
lc <- list()
lc[[1]] <- c("150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt","160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
, "JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt" ,"JF_160426_Dep2Plas_tryp_Gpep_SIDtarg-(06)_PSMs.txt"
)
lc[[2]] <- c(
"150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt" , "160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt",
"JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt"
)
lc[[3]] <- c(
"150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt",
"160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
)
# actual code
lc.path.v <- paste0(dataFnsDir,unlist(lc))
# maybe this is what you want?
lc.path.v
#> [1] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt"
#> [2] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
#> [3] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt"
#> [4] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/JF_160426_Dep2Plas_tryp_Gpep_SIDtarg-(06)_PSMs.txt"
#> [5] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt"
#> [6] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
#> [7] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt"
#> [8] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt"
#> [9] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
If you want to read all of them and combine them together, try this(it may not work because I don't know what the data looks like):
lc.alldf <- lapply(lc.path, read.delim, header = TRUE, sep = "/")
lc.onedf <- dplyr::bind_rows(lc.alldf)
Edit:
code improved, thanks! #Onyambu
If I understand correctly, the OP wants to create 3 new files each from the file names given as character vectors in each list element.
The main issue with OP's code is that newFile is overwritten in each iteration of the nested loops.
Here is what I would with my preferred tools (untested):
library(data.table) # for fread() and rbindlist()
library(magrittr) # use piping for clarity
lapply(
lc,
function(x) {
filenames <- file.path(dataFnsDir, x)
lapply(filenames, fread) %>%
rbindlist()
}
)
This will return a list of three dataframes (data.tables).
I do not have the OP's input files available but we can simulate the effect for demonstration. If we remove the second call to lapply() we will get a list of 3 elements each containing a character vector of file names with the path prepended.
lapply(
lc,
function(x) {
filenames <- file.path(dataFnsDir, x)
print(filenames)
}
)
[[1]]
[1] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt"
[2] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
[3] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt"
[4] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/JF_160426_Dep2Plas_tryp_Gpep_SIDtarg-(06)_PSMs.txt"
[[2]]
[1] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt"
[2] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
[3] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt"
[[3]]
[1] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt"
[2] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
Data
dataFnsDir <-"/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA"
lc <- list(
c("150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt",
"160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt",
"JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt",
"JF_160426_Dep2Plas_tryp_Gpep_SIDtarg-(06)_PSMs.txt"
),
c(
"150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt" ,
"160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt",
"JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt"
),
c(
"150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt",
"160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
)
)

R output tsv without scientific notation

I have this variable:
> output
[[1]]
[1] 1.394082e+19 3.481687e+18 1.829848e+19 1.414608e+19 1.694183e+19 1.394082e+19
[[2]]
[1] 1.569580e+19 1.569580e+19 1.204701e+19 1.600159e+19 6.915915e+18 4.672586e+18 1.095256e+19 1.395906e+19 2.199774e+18
[[3]]
[1] 1.602384e+19 2.610937e+18 1.750534e+19 3.749841e+17 1.602384e+19 1.921356e+18 1.490877e+19 1.858905e+17 9.592238e+18
[[4]]
[1] 1.488400e+19 8.239013e+18 1.397958e+19 1.488400e+19 5.659786e+17 1.235961e+18 1.802728e+19
[[5]]
[1] 1.038415e+19 5.060804e+18 3.892644e+18 1.038415e+19
I want to print to a tsv file without scientific notation. Something like
1.56958e+19 1.56958e+19 1.204701e+19 1.600159e+19 6.915915e+18
4.672586e+18 1.095256e+19 1.395906e+19 2.199774e+18
1.602384e+19 2.610937e+18 1.750534e+19 3.749841e+17 1.602384e+19
1.921356e+18 1.490877e+19 1.858905e+17 9.592238e+18
1.4884e+19 8.239013e+18 1.397958e+19 1.4884e+19 5.659786e+17
but without scientific notation. If I try,
x <- c(1:length(output)))
for ( val in x ) {
write(format(output[[val]],scientific = FALSE),file="test2.txt", append=TRUE,sep="\t")
}
I get new lines for every number:
13940817034235316224
3481686930273376256
18298477684861163520
14146076755830216704
16941832990896044032
13940817034235316224
15695800775901642752
15695800775901642752
Ive tried a few variations and having little experience with R leads me to believe I'm missing something simple?
You need to set up the ncolumns parameter explicitly. According to ?write, the default for ncolumns is:
ncolumns = if(is.character(x)) 1 else 5
Since format converts the numeric vector to a character vector, you always have one column per line; Programmatically you can set the ncolumns equal to the length of vector for each sublist:
for(vec in output) {
write(format(vec, scientific=FALSE), file="test2.txt", append=TRUE, sep="\t", ncolumns=length(vec))
}

Nested List Parsing with jsonlite

This is the second time that I have faced this recently, so I wanted to reach out to see if there is a better way to parse dataframes returned from jsonlite when one of elements is an array stored as a column in the dataframe as a list.
I know that this part of the power with jsonlite, but I am not sure how to work with this nested structure. In the end, I suppose that I can write my own custom parsing, but given that I am almost there, I wanted to see how to work with this data.
For example:
## options
options(stringsAsFactors=F)
## packages
library(httr)
library(jsonlite)
## setup
gameid="2015020759"
SEASON = '20152016'
BASE = "http://live.nhl.com/GameData/"
URL = paste0(BASE, SEASON, "/", gameid, "/PlayByPlay.json")
## get the data
x <- GET(URL)
## parse
api_response <- content(x, as="text")
api_response <- jsonlite::fromJSON(api_response, flatten=TRUE)
## get the data of interest
pbp <- api_response$data$game$plays$play
colnames(pbp)
And exploring what comes back:
> class(pbp$aoi)
[1] "list"
> class(pbp$desc)
[1] "character"
> class(pbp$xcoord)
[1] "integer"
From above, the column pbp$aoi is a list. Here are a few entries:
> head(pbp$aoi)
[[1]]
[1] 8465009 8470638 8471695 8473419 8475792 8475902
[[2]]
[1] 8470626 8471276 8471695 8476525 8476792 8477956
[[3]]
[1] 8469619 8471695 8473492 8474625 8475727 8476525
[[4]]
[1] 8469619 8471695 8473492 8474625 8475727 8476525
[[5]]
[1] 8469619 8471695 8473492 8474625 8475727 8476525
[[6]]
[1] 8469619 8471695 8473492 8474625 8475727 8475902
I don't really care if I parse these lists in the same dataframe, but what do I have for options to parse out the data?
I would prefer to take the data out of out lists and parse them into a dataframe that can be "related" to the original record it came from.
Thanks in advance for your help.
From #hrbmstr above, I was able to get what I wanted using unnest.
select(pbp, eventid, aoi) %>% unnest() %>% head

R - get values from multiple variables in the environment

I have some variables in my current R environment:
ls()
[1] "clt.list" "commands.list" "dirs.list" "eq" "hurs.list" "mlist" "prec.list" "temp.list" "vars"
[10] "vars.list" "wind.list"
where each one of the variables "clt.list", "hurs.list", "prec.list", "temp.list" and "wind.list" is a (huge) list of strings.
For example:
clt.list[1:20]
[1] "clt_Amon_ACCESS1-0_historical_r1i1p1_185001-200512.nc" "clt_Amon_ACCESS1-3_historical_r1i1p1_185001-200512.nc"
[3] "clt_Amon_bcc-csm1-1_historical_r1i1p1_185001-201212.nc" "clt_Amon_bcc-csm1-1-m_historical_r1i1p1_185001-201212.nc"
[5] "clt_Amon_BNU-ESM_historical_r1i1p1_185001-200512.nc" "clt_Amon_CanESM2_historical_r1i1p1_185001-200512.nc"
[7] "clt_Amon_CCSM4_historical_r1i1p1_185001-200512.nc" "clt_Amon_CESM1-BGC_historical_r1i1p1_185001-200512.nc"
[9] "clt_Amon_CESM1-CAM5_historical_r1i1p1_185001-200512.nc" "clt_Amon_CESM1-CAM5-1-FV2_historical_r1i1p1_185001-200512.nc"
[11] "clt_Amon_CESM1-FASTCHEM_historical_r1i1p1_185001-200512.nc" "clt_Amon_CESM1-WACCM_historical_r1i1p1_185001-200512.nc"
[13] "clt_Amon_CMCC-CESM_historical_r1i1p1_190001-190412.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_190001-200512.nc"
[15] "clt_Amon_CMCC-CESM_historical_r1i1p1_190501-190912.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_191001-191412.nc"
[17] "clt_Amon_CMCC-CESM_historical_r1i1p1_191501-191912.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_192001-192412.nc"
[19] "clt_Amon_CMCC-CESM_historical_r1i1p1_192501-192912.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_193001-193412.nc"
What I need to do is extract the subset of the string that is between "Amon_" and "_historical".
I can do this for a single variable, as shown here:
levels(as.factor(sub(".*?Amon_(.*?)_historical.*", "\\1", clt.list[1:20])))
[1] "ACCESS1-0" "ACCESS1-3" "bcc-csm1-1" "bcc-csm1-1-m" "BNU-ESM" "CanESM2" "CCSM4"
[8] "CESM1-BGC" "CESM1-CAM5" "CESM1-CAM5-1-FV2" "CESM1-FASTCHEM" "CESM1-WACCM" "CMCC-CESM"
However, what I'd like to do is to run the command above for all the five variables at once. Instead of using just "ctl.list" as argument in the command above, I'd like to use all variables "clt.list", "hurs.list", "prec.list", "temp.list" and "wind.list" at once.
How can I do that?
Many thanks in advance!
You can put your operation into a function and then iterate over it:
get_my_substr <- function(vecname)
levels(as.factor(sub(".*?Amon_(.*?)_historical.*", "\\1", get(vecname))))
lapply(my_vecnames,get_my_substr)
lapply acts like a loop. You can create your list of vector names with
my_vecnames <- ls(pattern=".list$")
It is generally good practice to post a reproducible example in your question. Since none was provided here, I tested this approach with...
# example-maker
prestr <- "grr_Amon_"
posstr <- "_historical_zzz"
make_ex <- function()
replicate(
sample(10,1),
paste0(prestr,paste0(sample(LETTERS,sample(5,1)),collapse=""),posstr)
)
# make a couple examples
set.seed(1)
m01 <- make_ex()
m02 <- make_ex()
# test result
lapply(ls(pattern="^m[0-9][0-9]$"),get_my_substr)
One solution would be to create a vector containing the variable names that you want extract the data from, for example:
var.names <- c("clt.list", "commands.list", "dirs.list")
Then to access the value of each variable from the name:
for (var.name in var.names) {
var.value <- as.list(environment())[[var.name]]
# Do something with var.value
}

Splitting string in r and putting them in new columns in new data frame

I want to do this in R.
I run a code to extract paths of all folders and subfolders. I get the list mentioned below. I want to apply a set of rules to this:
If 1 "/" is encountered in the whole line then replace that "/" with "/Folder/"
If 2 "/" is encountered in the whole line then do nothing.
If 3 or MORE "/" are encountered then ignore the first and last "/" and replace all remaining "/" with "-"
The code I run is to extract file path is:
b<-list.files(path="/Users/Mohit/Desktop/Company/Database",recursive=TRUE)
[1] "Accounts/Academic History.pdf" "Accounts/Contract.pdf"
[3] "Accounts/Credit/Analyst/Banking/TFileOutput.txt" "Accounts/Credit/Analyst/untitled.jpg"
[5] "Accounts/Credit/background.jpg" "Accounts/Credit/background.xcf"
[7] "Accounts/Debit/index.html" "Human Resources/RStudio-0.98.1073.dmg"
[9] "Information Technology/Iti.pdf" "Logistics/1610085_10152585224658626_398303669_n.jpg"
[11] "Sales/947309_10152376144413626_1056138683_n.jpg"
Im not able to understand which function to use. stringr package with sapply maybe?
I want to put this in a column with a heading and export it as text file.
Any help will be greatly appreciated.
Thanks very much
May be this helps:
library(stringr)
Ct <- str_count(b, "/")
b1 <- ifelse(Ct==1, gsub("[/]", "/Folder/", b),
ifelse(Ct>=3, gsub("(^([^/]+[/])|([/][^/]+)$)(*SKIP)(*F)|[/]", "-",b,
perl=TRUE), b))
b1
#[1] "Accounts/Folder/Academic History.pdf"
#[2] "Accounts/Folder/Contract.pdf"
#[3] "Accounts/Credit-Analyst-Banking/TFileOutput.txt"
#[4] "Accounts/Credit-Analyst/untitled.jpg"
#[5] "Accounts/Credit/background.jpg"
#[6] "Accounts/Credit/background.xcf"
#[7] "Accounts/Debit/index.html"
#[8] "Human Resources/Folder/RStudio-0.98.1073.dmg"
#[9] "Information Technology/Folder/Iti.pdf"
#[10] "Logistics/Folder/1610085_10152585224658626_398303669_n.jpg"
#[11] "Sales/Folder/947309_10152376144413626_1056138683_n.jpg"
If you want to create a data.frame and then export as .txt file
dat <- data.frame(b, b1, stringsAsFactors=FALSE)
write.table(dat, file="Mohit.txt", quote=FALSE, row.names=FALSE, sep=",")
Update
If you need to create 3 columns based on b1
datN <- setNames(read.table(text=b1, sep="/"), c("Class", "Title", "File"))
head(datN,2)
# Class Title File
#1 Accounts Folder Academic History.pdf
#2 Accounts Folder Contract.pdf
Now, you can save the file using write.table
data
b <- c("Accounts/Academic History.pdf", "Accounts/Contract.pdf", "Accounts/Credit/Analyst/Banking/TFileOutput.txt",
"Accounts/Credit/Analyst/untitled.jpg", "Accounts/Credit/background.jpg",
"Accounts/Credit/background.xcf", "Accounts/Debit/index.html",
"Human Resources/RStudio-0.98.1073.dmg", "Information Technology/Iti.pdf",
"Logistics/1610085_10152585224658626_398303669_n.jpg","Sales/947309_10152376144413626_1056138683_n.jpg"

Resources