How do I wrangle messy, raw data and import into R? - r

I have raw, messy data for time series containing around 1400 observations. Here is a snippet of what it looks like:
[new Date('2021-08-24'),1.67,1.68,0.9,null],[new Date('2021-08-23'),1.65,1.68,0.9,null],[new Date('2021-08-22'),1.62,1.68,0.9,null] ... etc
I want to pull the date and its respective value to form a tsibble in R. So, from the above values, it would be like
Date
y-variable
2021-08-24
1.67
2021-08-23
1.65
2021-08-22
1.62
Notice how only the first value is to be paired with its respective date - I don't need the other values. Right now, the raw data has been copied and pasted into a word document and I am unsure about how to approach data wrangling to import into R.
How could I achieve this?

#replace the text conncetion with a file connection if desired, the file should be a txt then
input <- readLines(textConnection("[new Date('2021-08-24'),1.67,1.68,0.9,null],[new Date('2021-08-23'),1.65,1.68,0.9,null],[new Date('2021-08-22'),1.62,1.68,0.9,null]"))
#insert line breaks
input <- gsub("],[", "\n", input, fixed = TRUE)
#remove "new Date"
input <- gsub("new Date", "", input, fixed = TRUE)
#remove parentheses and brackets
input <- gsub("[\\(\\)\\[\\]]", "", input, perl = TRUE)
#import cleaned data
DF <- read.csv(text = input, header = FALSE, quote = "'")
DF$V1 <- as.Date(DF$V1)
print(DF)
# V1 V2 V3 V4 V5
#1 2021-08-24 1.67 1.68 0.9 null
#2 2021-08-23 1.65 1.68 0.9 null
#3 2021-08-22 1.62 1.68 0.9 null

How is this?
text <- "[new Date('2021-08-24'),1.67,1.68,0.9,null],[new Date('2021-08-23'),1.65,1.68,0.9,null],[new Date('2021-08-22'),1.62,1.68,0.9,null]"
df <- read.table(text = unlist(strsplit(gsub('new Date\\(|\\)', '', gsub('^.(.*).$', '\\1', text)), "].\\[")), sep = ",")
> df
V1 V2 V3 V4 V5
1 2021-08-24 1.67 1.68 0.9 null
2 2021-08-23 1.65 1.68 0.9 null
3 2021-08-22 1.62 1.68 0.9 null
Changing column names and removing the last columns is trivial from this point

Related

Split data contained in one column into 3 columns in R

I have a dataset containing character vectors (that are really numbers) that i want to split into 3 different columns. These 3 columns need to have the 3 numbers contained in the original column.
Data<-data.frame(c("1.50 (1.30 to 1.70)", "1.30 (1.20 to 1.50)"))`
colnames(Data)<- "values"
Data
values
1.50 (1.30 to 1.70)
1.30 (1.20 to 1.50)
The result i expect is this.
value1 value2 value3
1.50 1.30 1.70
1.30 1.20 1.50
One way of doing this can be to use the seperate in package tidyr. From the documentation : Separate a character column into multiple columns with a regular expression or numeric locations
Adapting form the example in documentation, using decimal, and using extra="drop" for dropping discarded data without warnings :
Data<-data.frame(c("1.50 (1.30 to 1.70)", "1.30 (1.20 to 1.50)")))
colnames(Data)<- "values"
Data
require(tidyr)
separate(Data, col = values, into = paste0("value",1:3),
sep = "[^[:digit:]?\\.]+" , extra="drop")
#output
value1 value2 value3
> 1 150 0.130 170.0
> 2 13.02 120 150.5
We can also use extract specifying the regex pattern to extract data.
tidyr::extract(Data, values, paste0("value",1:3),
regex = '(\\d+\\.\\d+)\\s\\((\\d+\\.\\d+)\\sto\\s(\\d+\\.\\d+)\\)')
# value1 value2 value3
#1 1.50 1.30 1.70
#2 1.30 1.20 1.50
(\\d+\\.\\d+) is used to extract a decimal value
\\s is whitespace.
We use capture groups to extract the value in three different columns.
You can try this code:
library(easyr)
x = data.frame(c("1.50 (1.30 to 1.70)", "1.30 (1.20 to 1.50)"))
colnames(x)[1] = "val"
x$val1 = left(x$val, 4)
x$val2 = mid(x$val, 7,4)
x$val3 = mid(x$val, 15,4)

R: save txt with tab as header of rows

I have a very large data frame with SNPs in rows (~50.000) and IDs in columns (~500), imagine an extraction would look something like this:
R015 R016 R007
cg158 0.81 0.90 0.87
cg178 0.91 0.80 0.58
Now I want to save this as a txt, normally no problem with
write.table(example, "example.txt", colnames=T, rownames=T, quotes=F)
BUT I need to have a tab (\t) as first column entrance, so in the txt file the data frame should look sth like:
\t R015 R016 R007
cg158 0.81 0.90 0.87
cg178 0.91 0.80 0.58
(\t for the tab)
Can anyone help me how to do this?
Btw I also tried:
write.table(data.frame("\t"=rownames(example),example),"example.txt", row.names=FALSE)
It did not work, unfortunately...
Thanks!
This kind of works, just replace stdout() with the path to your output-file:
data <- data.frame(x = sample(1:100,3),
y = sample(1:100,3),
z = sample(1:100,3))
row.names(data) <- LETTERS[1:3]
lines <- c(paste(c(' ', names(data)), collapse = '\t'),
sapply(seq_len(nrow(data)),
function(i){
paste(c(row.names(data)[i], data[i,]),collapse = '\t')
}))
writeLines(lines, con = stdout())
#> x y z
#> A 35 97 27
#> B 12 69 24
#> C 25 9 34
Or with spaces as seperators and the tab you wished for in the first column:
data <- data.frame(x = sample(1:100,3),
y = sample(1:100,3),
z = sample(1:100,3))
row.names(data) <- LETTERS[1:3]
lines <- c(paste(c('\t', names(data)), collapse = ' '),
sapply(seq_len(nrow(data)),
function(i){
paste(c(row.names(data)[i], data[i,]),collapse = ' ')
}))
writeLines(lines, con = stdout())
#> x y z
#> A 3 30 11
#> B 62 69 70
#> C 93 55 73
Using a data frame like the following, where I've changed one row name to illustrate how to deal with cases of unequal length:
df <- read.table(text = "R015 R016 R007
cg158 0.81 0.90 0.87
cg178kdfj 0.91 0.80 0.58")
You could do something like this:
df <- format(as.matrix(df))
df <- cbind("\\t" = rownames(df), df)
df <- rbind(colnames(df), df)
df[,1] <- stringr::str_pad(df[,1], max(nchar(df[,1])), "right")
write.table(df,
file = "example.txt",
sep = " ",
quote = F,
row.names = F,
col.names = F)
Output:
\t R015 R016 R007
cg158 0.81 0.90 0.87
cg178kdfj 0.91 0.80 0.58
I first converted the numeric values to character and formatted them to make sure they have the same number of digits, otherwise they won't line up. Then I turn the row names into a new variable named \\t, and then I turn the column names into a new row. I use stringr::str_pad() to account for row names of differing lengths. Finally, I write the data frame to TXT file without the row or column names.

strsplit() output as a dataframe in r

I have some results from a model in Python which i have saved as a .txt to render in RMarkdown.
The .txt is this.
precision recall f1-score support
0 0.71 0.83 0.77 1078
1 0.76 0.61 0.67 931
avg / total 0.73 0.73 0.72 2009
I read the file into r as,
x <- read.table(file = 'report.txt', fill = T, sep = '\n')
When i save this, r saves the results as one column (V1) instead of 5 columns as below,
V1
1 precision recall f1-score support
2 0 0.71 0.83 0.77 1078
3 1 0.76 0.61 0.67 931
4 avg / total 0.73 0.73 0.72 2009
I tried using strsplit() to split the columns, but doesn't work.
strsplit(as.character(x$V1), split = "|", fixed = T)
May be strsplit() is not the right approach? How do i get around this so that i have a [4x5] dataframe.
Thanks a lot.
Not very elegant, but this works. First we read the raw text, then we use regex to clean up, delete white space, and convert to csv readable format. Then we read the csv.
library(stringr)
library(magrittr)
library(purrr)
text <- str_replace_all(readLines("~/Desktop/test.txt"), "\\s(?=/)|(?<=/)\\s", "") %>%
.[which(nchar(.)>0)] %>%
str_split(pattern = "\\s+") %>%
map(., ~paste(.x, collapse = ",")) %>%
unlist
read.csv(textConnection(text))
#> precision recall f1.score support
#> 0 0.71 0.83 0.77 1078
#> 1 0.76 0.61 0.67 931
#> avg/total 0.73 0.73 0.72 2009
Created on 2018-09-20 by the reprex
package (v0.2.0).
Since much simpler to have python output csv, i am posting an alternative here. Just in case if it is useful as even in python needs some work.
def report_to_csv(report, title):
report_data = []
lines = report.split('\n')
# loop through the lines
for line in lines[2:-3]:
row = {}
row_data = line.split(' ')
row['class'] = row_data[1]
row['precision'] = float(row_data[2])
row['recall'] = float(row_data[3])
row['f1_score'] = float(row_data[4])
row['support'] = float(row_data[5])
report_data.append(row)
df = pd.DataFrame.from_dict(report_data)
# read the final summary line
line_data = lines[-2].split(' ')
summary_dat = []
row2 = {}
row2['class'] = line_data[0]
row2['precision'] = float(line_data[1])
row2['recall'] = float(line_data[2])
row2['f1_score'] = float(line_data[3])
row2['support'] = float(line_data[4])
summary_dat.append(row2)
summary_df = pd.DataFrame.from_dict(summary_dat)
# concatenate both df.
report_final = pd.concat([df,summary_df], axis=0)
report_final.to_csv(title+'cm_report.csv', index = False)
Function inspired from this solution

How to transfer every three lines into columns in R or linux

My input is:
"name_01"
"name_02"
0.000573033 0.001268718 0.45 6.5e-01
"name_03"
"name_04"
0.00343343 0.0012435358 0.33 7.5e-09`
Expected output in tsv:
"name_01" "name_02" 0.0005 0.0019 0.45 6.5e-01
"name_03" "name_04" 0.0034 0.0012 0.33 7.5e-09
Can anyone help in R or linux?
Assuming this input:
s <- '"name_01"
"name_02"
0.000573033 0.001268718 0.45 6.5e-01
"name_03"
"name_04"
0.00343343 0.0012435358 0.33 7.5e-09'
1) Read it in using a what list and multi.line=TRUE scan arguments producing a list L; set its names and convert to a data.frame:
L <- scan(textConnection(s), what = list("", "", 0, 0, 0, 0),
multi.line = TRUE, quiet = TRUE)
names(L) <- paste0("V", seq_along(L))
do.call(data.frame, c(L, stringsAsFactors = FALSE))
giving:
V1 V2 V3 V4 V5 V6
1 name_01 name_02 0.000573033 0.001268718 0.45 6.5e-01
2 name_03 name_04 0.003433430 0.001243536 0.33 7.5e-09
2) This alternative also uses scan but instead of what using what and list we reshape it ourself into a matrix, convert that to a data.frame and make the last 4 columns numeric. If your input actually comes from a file replace textConnection(s) with something like `"myfile.txt". Note that the 6 in the first line of code refers to the number of columns to create and the 3:6 in the last line of code refers to the column numbers to be converted to numeric.
d <- as.data.frame(matrix(scan(textConnection(s), what = ""),, 6, byrow = TRUE),
stringsAsFactors = FALSE)
d[3:6] <- lapply(d[3:6], as.numeric)
giving:
> d
V1 V2 V3 V4 V5 V6
1 name_01 name_02 0.000573033 0.001268718 0.45 6.5e-01
2 name_03 name_04 0.003433430 0.001243536 0.33 7.5e-09
3) Here is another approach. We read in the data, pick out the data representing the first resulting column and then the second resulting column and then reread it, setting double quote as the comment character, so that the name rows of the input are omitted.
L <- readLines(textConnection(s))
data.frame(Name1 = L[c(TRUE, FALSE, FALSE)], Name2 = L[c(FALSE, TRUE, FALSE)],
read.table(text = L, comment = '"'))
giving:
Name1 Name2 V1 V2 V3 V4
1 "name_01" "name_02" 0.000573033 0.001268718 0.45 6.5e-01
2 "name_03" "name_04" 0.003433430 0.001243536 0.33 7.5e-09
Updates Added additional solutions, made some minor improvements and added clarification.

complex formatting using R

Since I am relatively new to R and I would like to use it for formatting large amount of data that are in one file (+30,000 unique IDs) and re-writing them in different format in another file.
I have a big problem in solving some sort of R-behaviour. Due to the complexity of the case I share a dropbox folder with the example files included. This is what I am trying to achieve:
I would like to have the file formatted like the "Final Product.dat" which is space delimited (but it has to keep rigorously the same spacing as here, otherwise the things I am running will not run!).
So, on the first line of the file there is an ID called "HUXXX...1" which will go from 1 to +30,000, then the next 5 lines will be always the same and then there are the data.
So, the lines from 2 to 4 are written in a separate file (SHead); and for the whole the line 5 I wrote in "header.txt"
For the first line, I created another file called ID-s100 which will contain the "HUXXXX" plus the ID column. I was hopin to use some sort of "sprintf" in that case (see code attached in dropbox).
the data are in the "S100.txt"; note that I have created an ID as first column.
My code kind of work but within each ID will repeat the same values for the whole ID. For example, for ID =1 instead of creating one HU0000001 and one profile (as in "Final Product.dat") it will create 12 times HU00001 and the same data, then do the same for ID 2.
What I want is exactly what "Final Product.dat" is: ONE Unique HUXXXXXN and for each of that HUXXXXN its correspondent data below (which should be matched by the ID column).
Can someone help me? I am now finding hard to make the code working with the feedback I can find from the Internet and I guess I might need some insight.
https://www.dropbox.com/sh/5gxpw0dy2brahho/AADirGyvMUjkTDPbO4fJmlLBa?dl=0
Since I understand the need for having the code and the data written here I try to do it as best as I could:
I hope the "Final Product" will look as good as it should be:
The bold on the first line is something I would to be changing following each ID
*HU00000001 AAAA C 150 SOME DEFAULT
so I have a file called ID-s100.txt which has the ID on Col 1 and the HU numbers on Col 2.
The these lines will be the same for each ID
# SITE COUNTRY LAT LON SC FL
LCO ................................................. (Some info go here)
# SCOM SALB SLU1 SLDR SLRO SLNF SLPF SMHB SMPX SMKE
.............................. (some other info goes here)
the above four lines are read in the SHead.txt file
Then the header.txt is the following:
# SLB SLMH SLLL SDUL SSAT SRGF SSKS SBDM SLOC SLCL SLSI SLCF SLNI SLHW SLHB SCEC SADC
and the data that goes below are stored in the "s100.csv" and "ksat and srfg.txt" and have the ID on the first column and the data afterwards. the "ksat..." is just one set of data repeated all over until the end, below is an extract of the data for the first ID (first column..):
1 10 0.02 0.04 0.12 0.42 4 8 1.4 2.02 6.18
1 20 0.02 0.04 0.12 0.42 4 8 1.4 2.02 6.18
1 30 0.02 0.04 0.12 0.42 4 8 1.4 2.02 6.18
1 40 0.02 0.04 0.12 0.42 4 8 1.4 0.5 5.5
2 5 0.02 0.04 0.12 0.42 3 8 1.4 2.4 5.5
2 10 0.02 0.04 0.12 0.42 3 8 1.4 2.4 5.5
2 20 0.02 0.04 0.12 0.42 3 8 1.4 2.4 5.5
...then it goes until +30,000
and here is the code:
Library(MASS)
setwd("C:\\....")
s100 = read.csv("C:\\.....\\S100.csv", header=F)
SHead = readLines("SHead.txt")
sID.n= read.table("C:\\.....\\ID-s100.txt",header=F)
sheader= read.table("C:\\.....\\header.txt", header=F)
ksat = read.table("C:\\.....\\Ksat and Srfg.txt", header=F)
m99 <- rep("-99", nrow(s100))
m9df = as.data.frame(m99)
SSAT.n = cbind(s100$V1, s100$V12, s100$V2, m9df, s100$V4, s100$V5, s100$V6, ksat$V1, ksat$V2,
s100$V9, s100$V10, s100$V7, s100$V8, m9df, m9df, s100$V11, m9df, m9df, m9df)
for(id in s100$V1){
SSAT = SSAT.n[s100$V1 == id, 2:19]
shead = as.matrix(sheader, dimnames = NULL)
SSAT[is.na(DSSAT)] <- " "
xxm = as.matrix(DSSAT, dimnames = NULL)
names(xxm) <- names(shead)
t1 = rbind(shead,xxm)
t2 = as.data.frame(t1, dimnames = NULL)
names(t2) = c(" ", " "," "," ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ")
t3 = format(t2, justify = "right", col.names=F, row.names=F, dimnames=NULL)
sID = sID.n[sID.n$V1 == id, 2]
Line2 = paste(sprintf("%1s%1s %5s", "*", sID, " UPSC SCL 140 SOIL PROFILE\n", "\n", sep=""))
file.n = "TRIAL.dat"
con = file(description = paste(file.n), open="a")
writeLines(Line2, con=con, sep= "\n")
writeLines(SoilHead, con = con , sep = "\n" )
write.table(t3, file= con, append = TRUE, quote=F, col.names=FALSE, row.names= F)
close(con)
}
Thank you in advance for any help!

Resources