Processing data from text file in R

Processing data from text file in R - r

So, I have been trying to turn a text file (each line is a chat log) into R to turn it into a data frame and further tidy the data.
I am using read.Lines so I can have each log as a single line. Because read.Lines reads them a single long char; I then convert them to strings (I need to parse the log).
I loaded a text file as data frame as per below.
rawchat <- readLines("disc-W-App-avec-loy.txt")
rawchat <- as.data.frame(rawchat, stringsAsFactors=FALSE)
names(rawchat) <- "chat"
I am currently trying to identify any row (42000) that starts with the number 1. I can't seem to apply correctly the startsWith() function or the dplyr starts_with(), even grepl with regular expressions.
Could it be the format of the observations of the data frame (chr)?

Related

Writing For Loop or Split function to separate data from Master data frame into smaller data frames

I am once again asking for your help and guidance! Super duper novice here so I apologize in advance for not explaining things properly or my general lack of knowledge for something that feels like it should be easy to do.
I have sets of compounds in one "master" list that need to be separated into smaller list. I want to be able to do this with a "for loop" or some iterative function so I am not changing the numbers for each list. I want to separate the compounds based off of the column "Run.Number" (there are 21 Run.Numbers)
Step 1: Load the programs needed and open File containing "Master List"
# tMSMS List separation
#Load library packages
library(ggplot2)
library(reshape)
library(readr) #loading the csv's
library(dplyr) #data manipulation
library(magrittr) #forward pipe
library(openxlsx) #open excel sheets
library(Rcpp) #got this from an error code while trying to open excel sheets
#STEP 1: open file
S1_MasterList<- read.xlsx("/Users/owner/Documents/Research/Yurok/Bioassay/Bioassay Data/220410_tMSMS_neg_R.xlsx")
Step 2: Currently, to go through each list, I have to change the "i" value for each iteration. And I also must change the name manually (Ctrl+F), by replacing "S2_Export_1" with "S2_Export_2" and so on as I move from list to list. Also, when making the smaller list, there are a handful of columns containing data that need to be removed from the “Master List”. The specific format of column names are so it will be compatible with LC-MS software. This list is saved as a .csv file, again for compatibility with LC-MS software
#STEP 2: Iterative
#Replace: S2_Export_1
i=1
(S2_Separate<- S1_MasterList[which(S1_MasterList$Run.Number == i), ])
%>%
(S2_Export_1<-data.frame(S2_Separate$On,
S2_Separate$`Prec..m/z`,
S2_Separate$Z,
S2_Separate$`Ret..Time.(min)`,
S2_Separate$`Delta.Ret..Time.(min)`,
S2_Separate$Iso..Width,
S2_Separate$Collision.Energy))
(colnames(S2_Export_1)<-c("On", "Prec. m/z", "Z","Ret. Time (min)", "Delta Ret. Time (min)", "Iso. Width", "Collision Energy"))
(write.csv(S2_Export_1, "/Users/owner/Documents/Research/Yurok/Bioassay/Bioassay Data/Runs/220425_neg_S2_Export_1.csv", row.names = FALSE))
Results: The output should look like this image provided below, and for this one particular data frame called "Master List", there should be 21 smaller data frames. I also want the data frames to be named S2_Export_1, S2_Export_2, S2_Export_3, S2_Export_4, etc.

First, select only required columns (consider processing/renaming non-syntactic names first to avoid extra work downstream):
s1_sub <- select(S1_MasterList, Sample.Number, On, `Prec..m/z`, Z,
`Ret..Time.(min)`, `Delta.Ret..Time.(min)`,
Iso..Width, Collision.Energy)
Then split s1_sub into a list of dataframes with split()
s1_split <- split(s1_sub, s1_sub$Sample.Number)
Finally, name the resulting list of dataframes with setNames():
s1_split <- setNames(s1_split, paste0("S2_export_", seq_along(s1_split))

Looping through dataframe names in R and saving out corresponding dataframe as Rds file

I have about 30 separate dataframes loaded in my R session each with various names. I also have a character vector called mydfs which contains the names of all those dataframes loaded into my R session. I am trying to loop over mydfs and save out as an rds file each dataframe listed in the elements of mydfs, but for some reason, I'm only able to save out the character string of the name of the dataframe I'm trying to save (not the datafame itself). Here is simulated, reproducible example of what I have:
#Create vector of dataframes that exist in base r to create a reproducible example
mydfs<-c("cars","iris","iris3","mtcars")
#My code that creates files, but they don't contain my dataframe data for some reason
for (i in 1:length(mydfs)){
savefile<-paste0(paste0("D:/Data/", mydfs[i]), ".Rds")
saveRDS(mydfs[i], file=savefile)
print(paste("Dataframe Saved:", mydfs[i]))
}
This results in the following log output:
[1] "Dataframe Saved: cars"
[1] "Dataframe Saved: iris"
[1] "Dataframe Saved: iris3"
[1] "Dataframe Saved: mtcars"
Then I try to read back in any of the files I created:
#But when read back in only contain a single character string of the dataframe name
a<-readRDS("D:/Data/iris3.Rds")
str(a)
chr "iris3"
Note that when I read iris3.Rds back into a new R session using readRDS, I don't have a dataframe as I was expecting, but a single character vector containing the name of the datafame and not the data.
I haven't been programming in R for a while, since my current client preferred SAS, so I think I am somehow getting macro variable looping in SAS confused with R and so that when I call saveRDS, I'm passing in a single character vector instead of the actual dataframe. How can I get the dataframe to be passed into saveRDS instead of the character?
Thanks for helping me untangle my SAS thinking with my somewhat rusty R thinking.

You're currently just saving the names of the dataframes. You can use the get function as follows:
mydfs<-c("cars","iris","iris3","mtcars")
for (i in 1:length(mydfs)){
savefile<-paste0(paste0("D:/Data/", mydfs[i]), ".Rds")
saveRDS(get(mydfs[i]), file=savefile)
print(paste("Dataframe Saved:", mydfs[i]))
}
readRDS('D:/Data/iris3.RDS')

R: How to read in a SAS dataset with all columns as character

I'm using R to tidy data supplied to me (in a SAS file) so that I can bulk insert it into a SQLserver database. The problem that I'm having is that sometimes numeric fields get transformed by R after I read them in eg.(the leading 0 gets dropped, some numeric fields convert to scientific notation, long ID numbers turn into gibberish after the 15th digit).
Reading all the data into R as character solves these issues. When I'm supplied a csv file I can use data.tables 'fread' function to specify colClasses = 'character' however as far as I'm aware something like this doesnt exist for the 'read_sas' function from the haven package.
Are there any workarounds or extra documentation on how I can better approach and solve this issue?
Edit to highlight issues (left values is numeric and what I want to avoid, right value is as character and what I want):
1.
postcode <- c(0629,'0629')
postcode
[1] "629" "0629"
2.
id <- c(12000000,'12000000')
id
[1] "1.2e+07" "12000000"
3.
options(scipen=999)
id <- c(123123123123123123123123,'123123123123123123123123')
id
[1] "123123123123123117883392" "123123123123123123123123"
How can I import the data directly from SAS so that all columns in the data frame are read in as character data type (in order to avoid data quality issues when I insert into SQLserver)

read.csv() R x must be numeric

I am trying to read data out of a csv-file.
The data consists of small integer numbers (53, 98 ...)
The csv was made with OpenOffice, the data stood there in the first column
one number in each row.
reading data was simple (no problem at all):
BirthNumbers <- read.csv(“/Users/.../RawData.csv”, header=FALSE)
Now I try to calculate mean(BirthNumbers) (for example),
but it is not possible, the error message:
x is not numeric
Where is my mistake?
Thanks for all help
Norbert

It's probably being read in as characters.
Try mean(as.numeric(BirthNumbers))

As per https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html (see Value section), read.csv returns a data frame.
You should be calling mean on the column of the data frame. Since you have no headers (given your header = FALSE), most likely the column is called V1 (verify by doing head(BirthNumbers) or colnames(BirthNumbers)), so you should do mean(BirthNumbers$V1).

Using R to create and merge zoo object time series from csv files

I have a large set of csv files in a single directory. These files contain two columns, Date and Price. The filename of filename.csv contains the unique identifier of the data series. I understand that missing values for merged data series can be handled when these times series data are zoo objects. I also understand that, in using the na.locf(merge() function, I can fill in the missing values with the most recent observations.
I want to automate the process of.
loading the *.csv file columnar Date and Price data into R dataframes.
establishing each distinct time series within the Merged zoo "portfolio of time series" objects with an identity that is equal to each of their s.
merging these zoo objects time series using MergedData <- na.locf(merge( )).
The ultimate goal, of course, is to use the fPortfolio package.
I've used the following statement to create a data frame of Date,Price pairs. The problem with this approach is that I lose the <filename> identifier of the time series data from the files.
result <- lapply(files, function(x) x <- read.csv(x) )
I understand that I can write code to generate the R statements required to do all these steps instance by instance. I'm wondering if there is some approach that wouldn't require me to do that. It's hard for me to believe that others haven't wanted to perform this same task.

Try this:
z <- read.zoo(files, header = TRUE, sep = ",")
z <- na.locf(z)
I have assumed a header line and lines like 2000-01-31,23.40 . Use whatever read.zoo arguments are necessary to accommodate whatever format you have.

You can have better formatting using sapply( keep the files names). Here I will keep lapply.
Assuming that all your files are in the same directory you can use list.files.
it is very handy for such workflow.
I would use read.zoo to get directly zoo objects(avoid later coercing)
For example:
zoo.objs <- lapply(list.files(path=MY_FILES_DIRECTORY,
pattern='^zoo_*.csv', ## I look for csv files,
## which names start with zoo_
full.names=T), ## to get full names path+filename
read.zoo)
I use now list.files again to rename my result
names(zoo.objs) <- list.files(path=MY_FILES_DIRECTORY,
pattern='^zoo_*.csv')

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Processing data from text file in R - r

Related

Writing For Loop or Split function to separate data from Master data frame into smaller data frames

Looping through dataframe names in R and saving out corresponding dataframe as Rds file

R: How to read in a SAS dataset with all columns as character

read.csv() R x must be numeric

Using R to create and merge zoo object time series from csv files

Categories

Resources