complex formatting using R - r

Since I am relatively new to R and I would like to use it for formatting large amount of data that are in one file (+30,000 unique IDs) and re-writing them in different format in another file.
I have a big problem in solving some sort of R-behaviour. Due to the complexity of the case I share a dropbox folder with the example files included. This is what I am trying to achieve:
I would like to have the file formatted like the "Final Product.dat" which is space delimited (but it has to keep rigorously the same spacing as here, otherwise the things I am running will not run!).
So, on the first line of the file there is an ID called "HUXXX...1" which will go from 1 to +30,000, then the next 5 lines will be always the same and then there are the data.
So, the lines from 2 to 4 are written in a separate file (SHead); and for the whole the line 5 I wrote in "header.txt"
For the first line, I created another file called ID-s100 which will contain the "HUXXXX" plus the ID column. I was hopin to use some sort of "sprintf" in that case (see code attached in dropbox).
the data are in the "S100.txt"; note that I have created an ID as first column.
My code kind of work but within each ID will repeat the same values for the whole ID. For example, for ID =1 instead of creating one HU0000001 and one profile (as in "Final Product.dat") it will create 12 times HU00001 and the same data, then do the same for ID 2.
What I want is exactly what "Final Product.dat" is: ONE Unique HUXXXXXN and for each of that HUXXXXN its correspondent data below (which should be matched by the ID column).
Can someone help me? I am now finding hard to make the code working with the feedback I can find from the Internet and I guess I might need some insight.
https://www.dropbox.com/sh/5gxpw0dy2brahho/AADirGyvMUjkTDPbO4fJmlLBa?dl=0
Since I understand the need for having the code and the data written here I try to do it as best as I could:
I hope the "Final Product" will look as good as it should be:
The bold on the first line is something I would to be changing following each ID
*HU00000001 AAAA C 150 SOME DEFAULT
so I have a file called ID-s100.txt which has the ID on Col 1 and the HU numbers on Col 2.
The these lines will be the same for each ID
# SITE COUNTRY LAT LON SC FL
LCO ................................................. (Some info go here)
# SCOM SALB SLU1 SLDR SLRO SLNF SLPF SMHB SMPX SMKE
.............................. (some other info goes here)
the above four lines are read in the SHead.txt file
Then the header.txt is the following:
# SLB SLMH SLLL SDUL SSAT SRGF SSKS SBDM SLOC SLCL SLSI SLCF SLNI SLHW SLHB SCEC SADC
and the data that goes below are stored in the "s100.csv" and "ksat and srfg.txt" and have the ID on the first column and the data afterwards. the "ksat..." is just one set of data repeated all over until the end, below is an extract of the data for the first ID (first column..):
1 10 0.02 0.04 0.12 0.42 4 8 1.4 2.02 6.18
1 20 0.02 0.04 0.12 0.42 4 8 1.4 2.02 6.18
1 30 0.02 0.04 0.12 0.42 4 8 1.4 2.02 6.18
1 40 0.02 0.04 0.12 0.42 4 8 1.4 0.5 5.5
2 5 0.02 0.04 0.12 0.42 3 8 1.4 2.4 5.5
2 10 0.02 0.04 0.12 0.42 3 8 1.4 2.4 5.5
2 20 0.02 0.04 0.12 0.42 3 8 1.4 2.4 5.5
...then it goes until +30,000
and here is the code:
Library(MASS)
setwd("C:\\....")
s100 = read.csv("C:\\.....\\S100.csv", header=F)
SHead = readLines("SHead.txt")
sID.n= read.table("C:\\.....\\ID-s100.txt",header=F)
sheader= read.table("C:\\.....\\header.txt", header=F)
ksat = read.table("C:\\.....\\Ksat and Srfg.txt", header=F)
m99 <- rep("-99", nrow(s100))
m9df = as.data.frame(m99)
SSAT.n = cbind(s100$V1, s100$V12, s100$V2, m9df, s100$V4, s100$V5, s100$V6, ksat$V1, ksat$V2,
s100$V9, s100$V10, s100$V7, s100$V8, m9df, m9df, s100$V11, m9df, m9df, m9df)
for(id in s100$V1){
SSAT = SSAT.n[s100$V1 == id, 2:19]
shead = as.matrix(sheader, dimnames = NULL)
SSAT[is.na(DSSAT)] <- " "
xxm = as.matrix(DSSAT, dimnames = NULL)
names(xxm) <- names(shead)
t1 = rbind(shead,xxm)
t2 = as.data.frame(t1, dimnames = NULL)
names(t2) = c(" ", " "," "," ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ")
t3 = format(t2, justify = "right", col.names=F, row.names=F, dimnames=NULL)
sID = sID.n[sID.n$V1 == id, 2]
Line2 = paste(sprintf("%1s%1s %5s", "*", sID, " UPSC SCL 140 SOIL PROFILE\n", "\n", sep=""))
file.n = "TRIAL.dat"
con = file(description = paste(file.n), open="a")
writeLines(Line2, con=con, sep= "\n")
writeLines(SoilHead, con = con , sep = "\n" )
write.table(t3, file= con, append = TRUE, quote=F, col.names=FALSE, row.names= F)
close(con)
}
Thank you in advance for any help!

Related

How do I wrangle messy, raw data and import into R?

I have raw, messy data for time series containing around 1400 observations. Here is a snippet of what it looks like:
[new Date('2021-08-24'),1.67,1.68,0.9,null],[new Date('2021-08-23'),1.65,1.68,0.9,null],[new Date('2021-08-22'),1.62,1.68,0.9,null] ... etc
I want to pull the date and its respective value to form a tsibble in R. So, from the above values, it would be like
Date
y-variable
2021-08-24
1.67
2021-08-23
1.65
2021-08-22
1.62
Notice how only the first value is to be paired with its respective date - I don't need the other values. Right now, the raw data has been copied and pasted into a word document and I am unsure about how to approach data wrangling to import into R.
How could I achieve this?
#replace the text conncetion with a file connection if desired, the file should be a txt then
input <- readLines(textConnection("[new Date('2021-08-24'),1.67,1.68,0.9,null],[new Date('2021-08-23'),1.65,1.68,0.9,null],[new Date('2021-08-22'),1.62,1.68,0.9,null]"))
#insert line breaks
input <- gsub("],[", "\n", input, fixed = TRUE)
#remove "new Date"
input <- gsub("new Date", "", input, fixed = TRUE)
#remove parentheses and brackets
input <- gsub("[\\(\\)\\[\\]]", "", input, perl = TRUE)
#import cleaned data
DF <- read.csv(text = input, header = FALSE, quote = "'")
DF$V1 <- as.Date(DF$V1)
print(DF)
# V1 V2 V3 V4 V5
#1 2021-08-24 1.67 1.68 0.9 null
#2 2021-08-23 1.65 1.68 0.9 null
#3 2021-08-22 1.62 1.68 0.9 null
How is this?
text <- "[new Date('2021-08-24'),1.67,1.68,0.9,null],[new Date('2021-08-23'),1.65,1.68,0.9,null],[new Date('2021-08-22'),1.62,1.68,0.9,null]"
df <- read.table(text = unlist(strsplit(gsub('new Date\\(|\\)', '', gsub('^.(.*).$', '\\1', text)), "].\\[")), sep = ",")
> df
V1 V2 V3 V4 V5
1 2021-08-24 1.67 1.68 0.9 null
2 2021-08-23 1.65 1.68 0.9 null
3 2021-08-22 1.62 1.68 0.9 null
Changing column names and removing the last columns is trivial from this point

Split data contained in one column into 3 columns in R

I have a dataset containing character vectors (that are really numbers) that i want to split into 3 different columns. These 3 columns need to have the 3 numbers contained in the original column.
Data<-data.frame(c("1.50 (1.30 to 1.70)", "1.30 (1.20 to 1.50)"))`
colnames(Data)<- "values"
Data
values
1.50 (1.30 to 1.70)
1.30 (1.20 to 1.50)
The result i expect is this.
value1 value2 value3
1.50 1.30 1.70
1.30 1.20 1.50
One way of doing this can be to use the seperate in package tidyr. From the documentation : Separate a character column into multiple columns with a regular expression or numeric locations
Adapting form the example in documentation, using decimal, and using extra="drop" for dropping discarded data without warnings :
Data<-data.frame(c("1.50 (1.30 to 1.70)", "1.30 (1.20 to 1.50)")))
colnames(Data)<- "values"
Data
require(tidyr)
separate(Data, col = values, into = paste0("value",1:3),
sep = "[^[:digit:]?\\.]+" , extra="drop")
#output
value1 value2 value3
> 1 150 0.130 170.0
> 2 13.02 120 150.5
We can also use extract specifying the regex pattern to extract data.
tidyr::extract(Data, values, paste0("value",1:3),
regex = '(\\d+\\.\\d+)\\s\\((\\d+\\.\\d+)\\sto\\s(\\d+\\.\\d+)\\)')
# value1 value2 value3
#1 1.50 1.30 1.70
#2 1.30 1.20 1.50
(\\d+\\.\\d+) is used to extract a decimal value
\\s is whitespace.
We use capture groups to extract the value in three different columns.
You can try this code:
library(easyr)
x = data.frame(c("1.50 (1.30 to 1.70)", "1.30 (1.20 to 1.50)"))
colnames(x)[1] = "val"
x$val1 = left(x$val, 4)
x$val2 = mid(x$val, 7,4)
x$val3 = mid(x$val, 15,4)

R: save txt with tab as header of rows

I have a very large data frame with SNPs in rows (~50.000) and IDs in columns (~500), imagine an extraction would look something like this:
R015 R016 R007
cg158 0.81 0.90 0.87
cg178 0.91 0.80 0.58
Now I want to save this as a txt, normally no problem with
write.table(example, "example.txt", colnames=T, rownames=T, quotes=F)
BUT I need to have a tab (\t) as first column entrance, so in the txt file the data frame should look sth like:
\t R015 R016 R007
cg158 0.81 0.90 0.87
cg178 0.91 0.80 0.58
(\t for the tab)
Can anyone help me how to do this?
Btw I also tried:
write.table(data.frame("\t"=rownames(example),example),"example.txt", row.names=FALSE)
It did not work, unfortunately...
Thanks!
This kind of works, just replace stdout() with the path to your output-file:
data <- data.frame(x = sample(1:100,3),
y = sample(1:100,3),
z = sample(1:100,3))
row.names(data) <- LETTERS[1:3]
lines <- c(paste(c(' ', names(data)), collapse = '\t'),
sapply(seq_len(nrow(data)),
function(i){
paste(c(row.names(data)[i], data[i,]),collapse = '\t')
}))
writeLines(lines, con = stdout())
#> x y z
#> A 35 97 27
#> B 12 69 24
#> C 25 9 34
Or with spaces as seperators and the tab you wished for in the first column:
data <- data.frame(x = sample(1:100,3),
y = sample(1:100,3),
z = sample(1:100,3))
row.names(data) <- LETTERS[1:3]
lines <- c(paste(c('\t', names(data)), collapse = ' '),
sapply(seq_len(nrow(data)),
function(i){
paste(c(row.names(data)[i], data[i,]),collapse = ' ')
}))
writeLines(lines, con = stdout())
#> x y z
#> A 3 30 11
#> B 62 69 70
#> C 93 55 73
Using a data frame like the following, where I've changed one row name to illustrate how to deal with cases of unequal length:
df <- read.table(text = "R015 R016 R007
cg158 0.81 0.90 0.87
cg178kdfj 0.91 0.80 0.58")
You could do something like this:
df <- format(as.matrix(df))
df <- cbind("\\t" = rownames(df), df)
df <- rbind(colnames(df), df)
df[,1] <- stringr::str_pad(df[,1], max(nchar(df[,1])), "right")
write.table(df,
file = "example.txt",
sep = " ",
quote = F,
row.names = F,
col.names = F)
Output:
\t R015 R016 R007
cg158 0.81 0.90 0.87
cg178kdfj 0.91 0.80 0.58
I first converted the numeric values to character and formatted them to make sure they have the same number of digits, otherwise they won't line up. Then I turn the row names into a new variable named \\t, and then I turn the column names into a new row. I use stringr::str_pad() to account for row names of differing lengths. Finally, I write the data frame to TXT file without the row or column names.

strsplit() output as a dataframe in r

I have some results from a model in Python which i have saved as a .txt to render in RMarkdown.
The .txt is this.
precision recall f1-score support
0 0.71 0.83 0.77 1078
1 0.76 0.61 0.67 931
avg / total 0.73 0.73 0.72 2009
I read the file into r as,
x <- read.table(file = 'report.txt', fill = T, sep = '\n')
When i save this, r saves the results as one column (V1) instead of 5 columns as below,
V1
1 precision recall f1-score support
2 0 0.71 0.83 0.77 1078
3 1 0.76 0.61 0.67 931
4 avg / total 0.73 0.73 0.72 2009
I tried using strsplit() to split the columns, but doesn't work.
strsplit(as.character(x$V1), split = "|", fixed = T)
May be strsplit() is not the right approach? How do i get around this so that i have a [4x5] dataframe.
Thanks a lot.
Not very elegant, but this works. First we read the raw text, then we use regex to clean up, delete white space, and convert to csv readable format. Then we read the csv.
library(stringr)
library(magrittr)
library(purrr)
text <- str_replace_all(readLines("~/Desktop/test.txt"), "\\s(?=/)|(?<=/)\\s", "") %>%
.[which(nchar(.)>0)] %>%
str_split(pattern = "\\s+") %>%
map(., ~paste(.x, collapse = ",")) %>%
unlist
read.csv(textConnection(text))
#> precision recall f1.score support
#> 0 0.71 0.83 0.77 1078
#> 1 0.76 0.61 0.67 931
#> avg/total 0.73 0.73 0.72 2009
Created on 2018-09-20 by the reprex
package (v0.2.0).
Since much simpler to have python output csv, i am posting an alternative here. Just in case if it is useful as even in python needs some work.
def report_to_csv(report, title):
report_data = []
lines = report.split('\n')
# loop through the lines
for line in lines[2:-3]:
row = {}
row_data = line.split(' ')
row['class'] = row_data[1]
row['precision'] = float(row_data[2])
row['recall'] = float(row_data[3])
row['f1_score'] = float(row_data[4])
row['support'] = float(row_data[5])
report_data.append(row)
df = pd.DataFrame.from_dict(report_data)
# read the final summary line
line_data = lines[-2].split(' ')
summary_dat = []
row2 = {}
row2['class'] = line_data[0]
row2['precision'] = float(line_data[1])
row2['recall'] = float(line_data[2])
row2['f1_score'] = float(line_data[3])
row2['support'] = float(line_data[4])
summary_dat.append(row2)
summary_df = pd.DataFrame.from_dict(summary_dat)
# concatenate both df.
report_final = pd.concat([df,summary_df], axis=0)
report_final.to_csv(title+'cm_report.csv', index = False)
Function inspired from this solution

How to begin at a different starting point for a summation in R

So I would like to read in data and do a summation of one of the columns of 18000 points of data. The thing is the summation requires the variable Tc and then to subtract five iterations before. I don't know how to make it start at its summation 5 data points down so it does not give me an error that there is nothing to subtract in the first 4 data points.
Here is what a small portion of the data looks like:
head(data)
Time Record Ux Uy Uz Ts Tc Tn To Tp Tq
1 2016-09-07 09:00:00.1 38651948 0.46 1.21 -0.26 19.53 19.31726 20.43197 19.39093 19.54993 NAN
2 2016-09-07 09:00:00.2 38651949 0.53 1.24 -0.24 19.48 19.30391 20.43744 19.37996 19.51704 NAN
3 2016-09-07 09:00:00.3 38651950 0.53 1.24 -0.24 19.48 19.31249 20.43269 19.3752 19.44648 NAN
4 2016-09-07 09:00:00.4 38651951 0.53 1.24 -0.24 19.48 19.30391 20.40221 19.33919 19.41596 NAN
5 2016-09-07 09:00:00.5 38651952 0.53 1.24 -0.24 19.48 19.24906 20.36079 19.31178 19.38068 NAN
6 2016-09-07 09:00:00.6 38651953 0.51 1.28 -0.28 19.44 19.20519 20.32008 19.30629 19.42693 NAN
Here is the code:
data <- read.csv(('TOA5_10815.raw_data5411_2016_09_07_0900.dat'),
header = FALSE,
dec = ",",
col.names = c("Time", "Record", "Ux", "Uy", "Uz", "Ts", "Tc", "Tn", "To", "Tp", "Tq"),
skip = 4)
Tc = data$Tc
sum = 0
m = 18000
j = 5
for (k in 1:(m-j)){
inner = (Tc[[k]]-Tc[[k-j]])
sum = sum + inner
}
final = 1/(m-j)*sum
Welcome to stackoverflow!
I would suggest you make a more reproducible example for your next questions here (see here).
To answer your question you can either to this in a for loop as you have been working on currently or in much more efficient way; using one type of apply functions (here: lapply). You can read more about these functions here.
Creating data set:
set.seed(1)
Tc<-rnorm(18000)
The lapply function. Note that we are starting on 6, since Tc[5] - Tc[c(5-5)] would just return Tc[5].
sum<-unlist(lapply(6:18000,function(x) Tc[x]-Tc[c(x-5)]))
Done!
Verifying the function by typing in console:
> head(sum)
[1] -0.1940146 0.3037857 1.5739533 -1.0194995 -0.6348962 2.3322496
> Tc[6]-Tc[1]
[1] -0.1940146

Resources