Very simple question. I am using an excel sheet that has two rows for the column headings; how can I convert these two row headings into one? Further, these headings don't start at the top of the sheet.
Thus, I have DF1
Temp Press Reagent Yield A Conversion etc
degC bar /g % %
1 2 3 4 5
6 7 8 9 10
and I want,
Temp degC Press bar Reagent /g Yield A % Conversion etc
1 2 3 4 5
6 7 8 9 10
Using colnames(DF1) returns the upper names, but getting the second line to merge with the upper one keeps eluding me.
Using your data, modified to quote text fields that contain the separator (get whatever tool you used to generate the file to quote text fields for you!)
txt <- "Temp Press Reagent 'Yield A' 'Conversion etc'
degC bar /g % %
1 2 3 4 5
6 7 8 9 10
"
this snippet of code below reads the file in two steps
First we read the data, so skip = 2 means skip the first 2 lines
Next we read the data again but only the first two line, this output is then further processed by sapply() where we paste(x, collapse = " ") the strings in the columns of the labs data frame. These are assigned to the names of dat
Here is the code:
dat <- read.table(text = txt, skip = 2)
labs <- read.table(text = txt, nrows = 2, stringsAsFactors = FALSE)
names(dat) <- sapply(labs, paste, collapse = " ")
dat
names(dat)
The code, when runs produces:
> dat <- read.table(text = txt, skip = 2)
> labs <- read.table(text = txt, nrows = 2, stringsAsFactors = FALSE)
> names(dat) <- sapply(labs, paste, collapse = " ")
>
> dat
Temp degC Press bar Reagent /g Yield A % Conversion etc %
1 1 2 3 4 5
2 6 7 8 9 10
> names(dat)
[1] "Temp degC" "Press bar" "Reagent /g"
[4] "Yield A %" "Conversion etc %"
In your case, you'll want to modify the read.table() calls to point at the file on your file system, so use file = "foo.txt" in place of text = txt in the code chunk, where "foo.txt" is the name of your file.
Also, if these headings don't start at the top of the file, then increase skip to 2+n where n is the number of lines before the two header rows. You'll also need to add skip = n to the second read.table() call which generates labs, where n is again the number of lines before the header lines.
This should work. You only need set stringsAsFactors=FALSE when reading data.
data <- structure(list(Temp = c("degC", "1", "6"), Press = c("bar", "2",
"7"), Reagent = c("/g", "3", "8"), Yield.A = c("%", "4", "9"),
Conversion = c("%", "5", "10")), .Names = c("Temp", "Press",
"Reagent", "Yield.A", "Conversion"), class = "data.frame", row.names = c(NA,
-3L)) # Your data
colnames(data) <-paste(colnames(dados),dados[1,]) # Set new names
data <- data[-1,] # Remove first line
data <- data.frame(apply(data,2,as.real)) # Correct the classes (works only if all collums are numbers)
Just load your file with read.table(file, header = FALSE, stringsAsFactors = F) arguments. Then, you can grep to find the position this happens.
df <- data.frame(V1=c(sample(10), "Temp", "degC"),
V2=c(sample(10), "Press", "bar"),
V3 = c(sample(10), "Reagent", "/g"),
V4 = c(sample(10), "Yield_A", "%"),
V5 = c(sample(10), "Conversion", "%"),
stringsAsFactors=F)
idx <- unique(c(grep("Temp", df$V1), grep("degC", df$V1)))
df2 <- df[-(idx), ]
names(df2) <- sapply(df[idx, ], function(x) paste(x, collapse=" "))
Here, if you want, you can then convert all the columns to numeric as follows:
df2 <- as.data.frame(sapply(df2, as.numeric))
Related
Im reading a text file containing data with "problematic" line. The last line that starts with *NOTE has to be removed (number of rows in the text file is not always the same):
ColumnA ColumnB ColumnC
A2 17 14
B2 20 -1
C2 21 36
*NOTE: -1 = data do not exist
This is my line to read the text file (i have to select text file since its location not constant:
my_data <- read.delim(file.choose(), header = TRUE, sep = "", quote = "",
dec = ".", fill = TRUE, comment.char = "")
I have tried :
my_data[- grep("*NOTE:", my_data$ColumnA),]
But it does not seem to work.
Any simple solutions to this?
You could call read.delim with comment.char = "*":
my_data <- read.delim(file.choose(), header = TRUE, sep = "", quote = "",
dec = ".", fill = TRUE, comment.char = "*")
This will remove the final line when you are reading it in because it starts with *.
Another option is fread from data.table. fread has a fancy autostart feature which automagically drops lines without the expected number of columns:
library(data.table)
fread(file.choose())
There is another way to handle this, which is to write a short function that takes the regexes that you want to filter out. You can feed it the file name, but if this is missing it will give you the file dialogue:
read_broken <- function(file_path, filter_out = "^[*]NOTE:")
{
if(missing(file_path)) file_path <- file.choose()
x <- suppressWarnings(readLines(file_path))
x <- x[nzchar(x)]
x <- x[!apply(sapply(filter_out, grepl, x), 1, any)]
read.delim(text = x, header = TRUE, sep = "", quote = "", dec = ".", fill = TRUE)
}
So you can do:
read_broken("myfile.txt")
#> ColumnA ColumnB ColumnC
#> 1 A2 17 14
#> 2 B2 20 -1
#> 3 C2 21 36
Or
read_broken("myfile.txt", filter_out = c("^[*]NOTE:", "A2"))
#> ColumnA ColumnB ColumnC
#> 1 B2 20 -1
#> 2 C2 21 36
I have downloaded some GDP data in .xls-format from the OECD website. However, to make this data workable in R, I need to reformat the data to a .csv file. More specifically, I need the year, day and month in the first column, and after the comma I need the GDP values (for example: 1990-01-01, 234590).
The column with GDP values can be easily copied and transposed, but how does one quickly add dates? Is there a fast way to do this, without having to add in the dates manually?
Thanks for the help!
Best,
Sean
PS. Link to (one of) the specific OECD files: https://ufile.io/8ogav or https://stats.oecd.org/index.aspx?queryid=350#
PSS. I have now changed the file to this:
Which I would like to transform into the same style as example 1.
Codes that I use for reading in data:
gdp.start <- c(1970,1) # type "double"
gdp.end <- c(2018,1)
gdp.raw <- "rawData/germany_gdp.csv"
gdp.table <- read.table(gdp.raw, skip = 1, header = F, sep = ',', stringsAsFactors = F)
gdp.ger <- ts(gdp.table[,2], start = gdp.start, frequency = 4) # time-series representation
PSS.
dput(head(gdp.table))
structure(list(V1 = c("Q2-1970;1.438.810 ", "Q3-1970;1.465.684 ",
"Q4-1970;1.478.108 ", "Q1-1971;1.449.712 ", "Q2-1971;1.480.136 ",
"Q3-1971;1.505.743 ")), row.names = c(NA, 6L), class = "data.frame")
Using your data:
z <- structure(list(V1 = c("Q2-1970;1.438.810 ", "Q3-1970;1.465.684 ",
"Q4-1970;1.478.108 ", "Q1-1971;1.449.712 ", "Q2-1971;1.480.136 ",
"Q3-1971;1.505.743 ")), row.names = c(NA, 6L), class = "data.frame")
dat <- read.csv2(text=paste(z$V1, collapse='\n'), stringsAsFactors=FALSE, header=FALSE)
dat
# V1 V2
# 1 Q2-1970 1.438.810
# 2 Q3-1970 1.465.684
# 3 Q4-1970 1.478.108
# 4 Q1-1971 1.449.712
# 5 Q2-1971 1.480.136
# 6 Q3-1971 1.505.743
and a simple function to replace quarters with the first date of each quarter
quarters <- function(s, format) {
qs <- c("Q1","Q2","Q3","Q4")
dts <- c("01-01", "04-01", "07-01", "10-01")
for (i in seq_along(qs))
s <- sub(qs[i], dts[i], s)
if (! missing(format))
s <- as.Date(s, format=format)
s
}
We can change them into strings of dates, preserving the order:
str(quarters(dat$V1))
# chr [1:6] "04-01-1970" "07-01-1970" "10-01-1970" "01-01-1971" ...
or we can convert into Date objects by setting the format:
str( quarters(dat$V1, format='%m-%d-%Y') )
# Date[1:6], format: "1970-04-01" "1970-07-01" "1970-10-01" "1971-01-01" ...
so replacing the column with the actual Date object is simply dat$V1 <- quarters(dat$V1, format='%m-%d-%Y').
There is the basic width : xxxx.xxxxxx (4digits before "." 6 digits after".")
Have to add "0" when each side before and after "." is not enough digits.
Use regexr find "[.]" location with combination of str_pad can
fix the first 4 digits but
don't know how to add value after the specific character with fixed digits.
(cannot find a library can count the location from somewhere specified)
Data like this
> df
Category
1 300.030340
2 3400.040290
3 700.07011
4 1700.0901
5 700.070114
6 700.0791
7 3600.05059
8 4400.0402
Desired data
> df
Category
1 0300.030340
2 3400.040290
3 0700.070110
4 1700.090100
5 0700.070114
6 0700.079100
7 3600.050590
8 4400.040200
I am beginner of coding that sometime can't understand some regex like "["
e.t.c .With some explain of them would be super helpful.
Also i have a combination like this :
df$Category<-ifelse(regexpr("[.]",df$Category)==4,
paste("0",df1$Category,sep = ""),df$Category)
df$Category<-str_pad(df$Category,11,side = c("right"),pad="0")
Desire to know are there is any better way do this , especially count and
return the location from the END until specific character appear.
Using formatC:
df$Category <- formatC(as.numeric(df$Category), format = 'f', width = 11, flag = '0', digits = 6)
# > df
# Category
# 1 0300.030340
# 2 3400.040290
# 3 0700.070110
# 4 1700.090100
# 5 0700.070114
# 6 0700.079100
# 7 3600.050590
# 8 4400.040200
format = 'f': formating doubles;
width = 11: 4 digits before . + 1 . + 6 digits after .;
flag = '0': pads leading zeros;
digits = 6: the desired number of digits after the decimal point (format = "f");
Input df seems to be character data.frame:
structure(list(Category = c("300.030340", "3400.040290", "700.07011",
"1700.0901", "700.070114", "700.0791", "3600.05059", "4400.0402"
)), .Names = "Category", row.names = c(NA, -8L), class = "data.frame")
We can use sprintf
df$Category <- sprintf("%011.6f", df$Category)
df
# Category
#1 0300.030340
#2 3400.040290
#3 0700.070110
#4 1700.090100
#5 0700.070114
#6 0700.079100
#7 3600.050590
#8 4400.040200
data
df <- structure(list(Category = c(300.03034, 3400.04029, 700.07011,
1700.0901, 700.070114, 700.0791, 3600.05059, 4400.0402)),
.Names = "Category", class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))
There are plenty of great tricks, functions, and shortcuts to be learned, and I would encourage you to explore them all! For example, if you're trying to win code golf, you will want to use #akrun's sprintf() approach. Since you stated you're a beginner, it might be more helpful to breakdown the problem into its component parts. One transparent and easy-to-follow, in my opinion, approach would be to utilize the stringr package:
library(stringr)
location_of_dot <- str_locate(df$Category, "\\.")[, 1]
substring_left_of_dot <- str_sub(df$Category, end = location_of_dot - 1)
substring_right_of_dot <- str_sub(df$Category, start = location_of_dot + 1)
pad_left <- str_pad(substring_left_of_dot, 4, side = "left", pad = "0")
pad_right <- str_pad(substring_right_of_dot, 6, side = "right", pad = "0")
result <- paste0(pad_left, ".", pad_right)
result
Use separate in tidyr to separate Category on decimal. Use str_pad from stringr to add zeros in the front or back and paste them together.
library(tidyr) # to separate columns on decimal
library(dplyr) # to mutate and pipes
library(stringr) # to strpad
input_data <- read.table(text =" Category
1 300.030340
2 3400.040290
3 700.07011
4 1700.0901
5 700.070114
6 700.0791
7 3600.05059
8 4400.0402", header = TRUE, stringsAsFactors = FALSE) %>%
separate(Category, into = c("col1", "col2")) %>%
mutate(col1 = str_pad(col1, width = 4, side= "left", pad ="0"),
col2 = str_pad(col2, width = 6, side= "right", pad ="0"),
Category = paste(col1, col2, sep = ".")) %>%
select(-col1, -col2)
This question already has answers here:
read/write data in libsvm format
(7 answers)
Closed 8 years ago.
It turns out the format I wanted is called "SVM-Light" and is described here http://svmlight.joachims.org/.
I have a data frame that I would like to convert to a text file with format as follows:
output featureIndex:featureValue ... featureIndex:featureValue
So for example:
t = structure(list(feature1 = c(3.28, 6.88), feature2 = c(0.61, 1.83
), output = c("1", "-1")), .Names = c("feature1", "feature2",
"output"), row.names = c(NA, -2L), class = "data.frame")
t
# feature1 feature2 output
# 1 3.28 0.61 1
# 2 6.88 1.83 -1
would become:
1 feature1:3.28 feature2:0.61
-1 feature1:6.88 feature2:1.83
My code so far:
nvars = 2
l = array("row", nrow(t))
for(i in(1:nrow(t)))
{
l = t$output[i]
for(n in (1:nvars))
{
thisFeatureString = paste(names(t)[n], t[[names(t)[n]]][i], sep=":")
l[i] = paste(l[i], thisFeatureString)
}
}
but I am not sure how to complete and write the results to a text file.
Also the code is probably not efficient.
Is there a library function that does this? as this kind of output format seems common for Vowpal Wabbit for example.
I couln't find a ready-made solution, although the svm-light data format seems to be widely used.
Here is a working solution (at least in my case):
############### CONVERT DATA TO SVM-LIGHT FORMAT ##################################
# data_frame MUST have a column 'target'
# target values are assumed to be -1 or 1
# all other columns are treated as features
###################################################################################
ConvertDataFrameTo_SVM_LIGHT_Format <- function(data_frame)
{
l = array("row", nrow(data_frame)) # l for "lines"
for(i in(1:nrow(data_frame)))
{
# we start each line with the target value
l[i] = data_frame$target[i]
# then append to the line each feature index (which is n) and its
# feature value (data_frame[[names(data_frame)[n]]][i])
for(n in (1:nvars))
{
thisFeatureString = paste(n, data_frame[[names(data_frame)[n]]][i], sep=":")
l[i] = paste(l[i], thisFeatureString)
}
}
return (l)
}
###################################################################################
If you don't mind not having the column names in the output, I think you could use a simple apply to do that:
apply(t, 1, function(x) paste(x, collapse=" "))
#[1] "3.28 0.61 1" "6.88 1.83 -1"
And to adjust the order of appearance in the output to your function's output you could do:
apply(t[c(3, 1, 2)], 1, function(x) paste(x, collapse=" "))
#[1] "1 3.28 0.61" "-1 6.88 1.83"
I have the following data frame
Loci p-value chromosome start end geneDescription
A 2.046584849E-2 1 98542 98699 tyrosine kinase
B 5.67849483E-20 2 8958437 8958437 endocytosis
...
However, when I want to print the data frame with the following code:
write.table(table,"~/Desktop/genes.txt", sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE, append = FALSE)
I get the following:
Loci p-value chromosome start end geneDescription
A 2.046584849E-20 1 98542 98699 tyrosine kinase
B 5.67849483E-20 2 8958437 8958437 endocytosis
I know that it has to do with the "\t", but can R adjust automatically the width of the columns when printing to get the original data frame above?
Thank you.
No, since this is a tab formatting issue and can be partially solved by increasing the tabwidth on you editor. Try normalizing the length of the column names.
max.name <- max(sapply(colnames(table), nchar))
colnames(table) <- sapply(colnames(table), function(name) paste0(c(name, rep(" ", max.name - nchar(name))), collapse = ''))
Perhaps you're just looking for capture.output or sink.
In the following examples, replace x with an actual file name. This is just done for illustrative purposes.
x <- tempfile()
capture.output(mydf, file=x)
readLines(x)
# [1] " Loci p.value chromosome start end geneDescription"
# [2] "1 A 2.046585e-02 1 98542 98699 tyrosinekinase"
# [3] "2 B 5.678495e-20 2 8958437 8958437 endocytosis"
x <- tempfile()
sink(file = x)
mydf
sink()
readLines(x)
# [1] " Loci p.value chromosome start end geneDescription"
# [2] "1 A 2.046585e-02 1 98542 98699 tyrosinekinase"
# [3] "2 B 5.678495e-20 2 8958437 8958437 endocytosis"
The readLines step is just to show you what was written to your "file".