I have downloaded some GDP data in .xls-format from the OECD website. However, to make this data workable in R, I need to reformat the data to a .csv file. More specifically, I need the year, day and month in the first column, and after the comma I need the GDP values (for example: 1990-01-01, 234590).
The column with GDP values can be easily copied and transposed, but how does one quickly add dates? Is there a fast way to do this, without having to add in the dates manually?
Thanks for the help!
Best,
Sean
PS. Link to (one of) the specific OECD files: https://ufile.io/8ogav or https://stats.oecd.org/index.aspx?queryid=350#
PSS. I have now changed the file to this:
Which I would like to transform into the same style as example 1.
Codes that I use for reading in data:
gdp.start <- c(1970,1) # type "double"
gdp.end <- c(2018,1)
gdp.raw <- "rawData/germany_gdp.csv"
gdp.table <- read.table(gdp.raw, skip = 1, header = F, sep = ',', stringsAsFactors = F)
gdp.ger <- ts(gdp.table[,2], start = gdp.start, frequency = 4) # time-series representation
PSS.
dput(head(gdp.table))
structure(list(V1 = c("Q2-1970;1.438.810 ", "Q3-1970;1.465.684 ",
"Q4-1970;1.478.108 ", "Q1-1971;1.449.712 ", "Q2-1971;1.480.136 ",
"Q3-1971;1.505.743 ")), row.names = c(NA, 6L), class = "data.frame")
Using your data:
z <- structure(list(V1 = c("Q2-1970;1.438.810 ", "Q3-1970;1.465.684 ",
"Q4-1970;1.478.108 ", "Q1-1971;1.449.712 ", "Q2-1971;1.480.136 ",
"Q3-1971;1.505.743 ")), row.names = c(NA, 6L), class = "data.frame")
dat <- read.csv2(text=paste(z$V1, collapse='\n'), stringsAsFactors=FALSE, header=FALSE)
dat
# V1 V2
# 1 Q2-1970 1.438.810
# 2 Q3-1970 1.465.684
# 3 Q4-1970 1.478.108
# 4 Q1-1971 1.449.712
# 5 Q2-1971 1.480.136
# 6 Q3-1971 1.505.743
and a simple function to replace quarters with the first date of each quarter
quarters <- function(s, format) {
qs <- c("Q1","Q2","Q3","Q4")
dts <- c("01-01", "04-01", "07-01", "10-01")
for (i in seq_along(qs))
s <- sub(qs[i], dts[i], s)
if (! missing(format))
s <- as.Date(s, format=format)
s
}
We can change them into strings of dates, preserving the order:
str(quarters(dat$V1))
# chr [1:6] "04-01-1970" "07-01-1970" "10-01-1970" "01-01-1971" ...
or we can convert into Date objects by setting the format:
str( quarters(dat$V1, format='%m-%d-%Y') )
# Date[1:6], format: "1970-04-01" "1970-07-01" "1970-10-01" "1971-01-01" ...
so replacing the column with the actual Date object is simply dat$V1 <- quarters(dat$V1, format='%m-%d-%Y').
Related
I have a data.frame like this
z <- structure(list(ID = c("R-HSA-977606", "R-HSA-977443", "R-HSA-166658",
"R-HSA-166663", "R-HSA-1236394", "R-HSA-390522", "R-HSA-3232118",
"R-HSA-1630316", "R-HSA-112315", "R-HSA-112314"), GeneRatio = c("6/189",
"6/189", "6/189", "4/189", "5/189", "4/189", "3/189", "7/189",
"11/189", "9/189")), row.names = c("R-HSA-977606", "R-HSA-977443",
"R-HSA-166658", "R-HSA-166663", "R-HSA-1236394", "R-HSA-390522",
"R-HSA-3232118", "R-HSA-1630316", "R-HSA-112315", "R-HSA-112314"
), class = "data.frame")
Is it possible to add a 3rd column with the ratio from the 2nd column calculated? i.e. 6/189=0.0317. So in the third column I should have 0.0317.
As it is a string expression, we can use eval/parse
z$newColumn <- sapply(z$GeneRatio, function(x) eval(parse(text = x)))
-output
> z
ID GeneRatio newColumn
R-HSA-977606 R-HSA-977606 6/189 0.03174603
R-HSA-977443 R-HSA-977443 6/189 0.03174603
R-HSA-166658 R-HSA-166658 6/189 0.03174603
R-HSA-166663 R-HSA-166663 4/189 0.02116402
R-HSA-1236394 R-HSA-1236394 5/189 0.02645503
R-HSA-390522 R-HSA-390522 4/189 0.02116402
R-HSA-3232118 R-HSA-3232118 3/189 0.01587302
R-HSA-1630316 R-HSA-1630316 7/189 0.03703704
R-HSA-112315 R-HSA-112315 11/189 0.05820106
R-HSA-112314 R-HSA-112314 9/189 0.04761905
Or a faster option would be to split by / (or use read.table to create two columns and then divide (assuming the expression includes only division)
z$newColumn <- Reduce(`/`, read.table(text = z$GeneRatio,
header = FALSE, sep = "/"))
This code could be refined but it will work with the eval function
# 1- Creating empty column
z$GeneRatioNum <- NA
# 2- Filling it with eval function
for(i in 1:nrow(z)){z$GeneRatioNum[i] <- (eval(parse(text = z$GeneRatio[i])))}
I have a large dataset of gene expression data and I'm trying to convert the gene identifiers into gene names using biomaRt in RStudio, but for some reason when I use the merge function on my data frames, my entire data table is merged wrong/erased. I've looked at the previous questions here, but no matter what I try, my code doesn't seem to work properly. Thank you infinitely!
library(biomaRt)
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)), by="row.names", sort=FALSE)
names(resdata)[1] <- "genes"
head(resdata)
## Write results
resdata <- resdata[complete.cases(resdata), ]
dim(resdata)
The problems start here:
#to convert gene accession number to gene name
charg <- resdata$genes
head(charg)
charg2 = sapply(strsplit(charg, '.', fixed=T), function(x) x[1])
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
theBM = getBM(attributes='hgnc_symbol',
filters = 'ensembl_gene_id',
values = charg2,
mart = ensembl)
resdata <- merge.data.frame(resdata, theBM, by.x="genes",by.y="hgnc_symbol")
# a <- c(resdata[3])
# counts_resdata <-counts[resdata$ensembl_gene_id,]
# row.names(counts_resdata) <- resdata[,"V1"]
# cal_z_score <- function(x){
# (x - mean(x)) / sd(x)
# }
write.csv(resdata, file="diffexprresultsHEK.csv")
dev.off()
> dput(head(resdata))
structure(list(genes = structure(c("ENSG00000261150.2", "ENSG00000164877.18",
"ENSG00000120334.15", "ENSG00000100906.10", "ENSG00000182759.3",
"ENSG00000124145.6"), class = "AsIs"), baseMean = c(4093.85581350533,
2362.58393155573, 3727.90538524843, 6269.83601940967, 1514.2066991352,
4802.56186913745), log2FoldChange = c(-7.91660950515258, -5.26346217291626,
3.32325541003148, 2.95482654632078, -5.67082078657074, 2.79396304109662
), lfcSE = c(0.192088463317979, 0.149333035266368, 0.105355230912976,
0.097569264524605, 0.194208068005162, 0.0965853229316347), stat = c(-41.2133522670104,
-35.2464688307429, 31.5433356391815, 30.2843990955331, -29.1997178326289,
28.9274079776516), pvalue = c(0, 3.88608699685236e-272, 2.21307385030673e-218,
1.83983881587879e-201, 1.95527687476496e-187, 5.40010609376884e-184
), padj = c(0, 3.9601169541424e-268, 1.50348860477005e-214, 9.3744387266064e-198,
7.97009959691694e-184, 1.83432603828505e-180), `HEK-FUS1-1.counts` = c(8260.9703617894,
5075.51515177084, 665.085490083024, 1513.61286043731, 3440.18729968435,
1262.3583419615), `HEK-FUS1-2.counts` = c(8046.96326903085, 4134.79795973702,
690.697680591815, 1346.52518701783, 2499.92325557892, 1154.73922910593
), `HEK-H149A-1.counts` = c(34.3284200812733, 113.825813953696,
6450.12945737609, 10806.2252897945, 60.5264248801398, 8302.96076228903
), `HEK-H149A-2.counts` = c(33.1612031197744, 126.196800761364,
7105.70891294277, 11412.980740389, 56.1898163973955, 8490.18914319335
)), row.names = c(NA, 6L), class = "data.frame")
Here's some output (where I'm struggling):
> head(charg)
[1] "ENSG00000261150.2" "ENSG00000164877.18" "ENSG00000120334.15"
[4] "ENSG00000100906.10" "ENSG00000182759.3" "ENSG00000124145.6"
> dim(theBM)
[1] 0 1
> head(theBM)
[1] ensembl_gene_id
<0 rows> (or 0-length row.names)
> dim(resdata)
[1] 20381 11
> resdata <- merge.data.frame(resdata, theBM, by.x="genes",by.y="ensembl_gene_id")
> dim(resdata) #after merge
[1] 0 11 #isn't correct -- just row names! where'd my genes go?
Edit: Problems solved! Turns out I was referencing getBM wrong. Thank you all!
If you want to just overwrite the Ensemble IDs with the HGNC IDs you can do it in one step:
library(biomaRt)
names(resdata)[1] <- "genes"
head(resdata)
## Write results
resdata <- resdata[complete.cases(resdata), ]
dim(resdata)
charg <- resdata$genes
head(charg)
charg2 = sapply(strsplit(charg, '.', fixed=T), function(x) x[1])
ensembl = useMart(biomart = "ensembl", dataset="hsapiens_gene_ensembl")
resdata[1] = getBM(attributes='hgnc_symbol',
filters = 'ensembl_gene_id',
values = charg2,
mart = ensembl)
resdata
(This keeps Log2FC as column 3, which looks right based on the next steps in your pipeline, but if you want something different let me know and I'll update my answer to suit)
I have a .txt file that consists of some investment data. I want to convert the data in file to data frame with three columns. Data in .txt file looks like below.
Date:
06-04-15, 07-04-15, 08-04-15, 09-04-15, 10-04-15
Equity :
-237.79, -170.37, 304.32, 54.19, -130.5
Debt :
16318.49, 9543.76, 6421.67, 3590.47, 2386.3
If you are going to use read.table(), then the following may help:
Assuming the dat.txt contains above contents, then
dat <- read.table("dat.txt",fill=T,sep = ",")
df <- as.data.frame(t(dat[seq(2,nrow(dat),by=2),]))
rownames(df) <- seq(nrow(df))
colnames(df) <- trimws(gsub(":","",dat[seq(1,nrow(dat),by=2),1]))
yielding:
> df
Date Equity Debt
1 06-04-15 -237.79 16318.49
2 07-04-15 -170.37 9543.76
3 08-04-15 304.32 6421.67
4 09-04-15 54.19 3590.47
5 10-04-15 -130.5 2386.3
Assuming the text file name is demo.txt here is one way to do this
#Read the file line by line
all_vals <- readLines("demo.txt")
#Since the column names and data are in alternate lines
#We first gather column names together and clean them
column_names <- trimws(sub(":", "", all_vals[c(TRUE, FALSE)]))
#we can then paste the data part together and assign column names to it
df <- setNames(data.frame(t(read.table(text = paste0(all_vals[c(FALSE, TRUE)],
collapse = "\n"), sep = ",")), row.names = NULL), column_names)
#Since most of the data is read as factors, we use type.convert to
#convert data in their respective format.
type.convert(df)
# Date Equity Debt
#1 06-04-15 -237.79 16318.49
#2 07-04-15 -170.37 9543.76
#3 08-04-15 304.32 6421.67
#4 09-04-15 54.19 3590.47
#5 10-04-15 -130.50 2386.30
This question already has answers here:
read/write data in libsvm format
(7 answers)
Closed 8 years ago.
It turns out the format I wanted is called "SVM-Light" and is described here http://svmlight.joachims.org/.
I have a data frame that I would like to convert to a text file with format as follows:
output featureIndex:featureValue ... featureIndex:featureValue
So for example:
t = structure(list(feature1 = c(3.28, 6.88), feature2 = c(0.61, 1.83
), output = c("1", "-1")), .Names = c("feature1", "feature2",
"output"), row.names = c(NA, -2L), class = "data.frame")
t
# feature1 feature2 output
# 1 3.28 0.61 1
# 2 6.88 1.83 -1
would become:
1 feature1:3.28 feature2:0.61
-1 feature1:6.88 feature2:1.83
My code so far:
nvars = 2
l = array("row", nrow(t))
for(i in(1:nrow(t)))
{
l = t$output[i]
for(n in (1:nvars))
{
thisFeatureString = paste(names(t)[n], t[[names(t)[n]]][i], sep=":")
l[i] = paste(l[i], thisFeatureString)
}
}
but I am not sure how to complete and write the results to a text file.
Also the code is probably not efficient.
Is there a library function that does this? as this kind of output format seems common for Vowpal Wabbit for example.
I couln't find a ready-made solution, although the svm-light data format seems to be widely used.
Here is a working solution (at least in my case):
############### CONVERT DATA TO SVM-LIGHT FORMAT ##################################
# data_frame MUST have a column 'target'
# target values are assumed to be -1 or 1
# all other columns are treated as features
###################################################################################
ConvertDataFrameTo_SVM_LIGHT_Format <- function(data_frame)
{
l = array("row", nrow(data_frame)) # l for "lines"
for(i in(1:nrow(data_frame)))
{
# we start each line with the target value
l[i] = data_frame$target[i]
# then append to the line each feature index (which is n) and its
# feature value (data_frame[[names(data_frame)[n]]][i])
for(n in (1:nvars))
{
thisFeatureString = paste(n, data_frame[[names(data_frame)[n]]][i], sep=":")
l[i] = paste(l[i], thisFeatureString)
}
}
return (l)
}
###################################################################################
If you don't mind not having the column names in the output, I think you could use a simple apply to do that:
apply(t, 1, function(x) paste(x, collapse=" "))
#[1] "3.28 0.61 1" "6.88 1.83 -1"
And to adjust the order of appearance in the output to your function's output you could do:
apply(t[c(3, 1, 2)], 1, function(x) paste(x, collapse=" "))
#[1] "1 3.28 0.61" "-1 6.88 1.83"
Very simple question. I am using an excel sheet that has two rows for the column headings; how can I convert these two row headings into one? Further, these headings don't start at the top of the sheet.
Thus, I have DF1
Temp Press Reagent Yield A Conversion etc
degC bar /g % %
1 2 3 4 5
6 7 8 9 10
and I want,
Temp degC Press bar Reagent /g Yield A % Conversion etc
1 2 3 4 5
6 7 8 9 10
Using colnames(DF1) returns the upper names, but getting the second line to merge with the upper one keeps eluding me.
Using your data, modified to quote text fields that contain the separator (get whatever tool you used to generate the file to quote text fields for you!)
txt <- "Temp Press Reagent 'Yield A' 'Conversion etc'
degC bar /g % %
1 2 3 4 5
6 7 8 9 10
"
this snippet of code below reads the file in two steps
First we read the data, so skip = 2 means skip the first 2 lines
Next we read the data again but only the first two line, this output is then further processed by sapply() where we paste(x, collapse = " ") the strings in the columns of the labs data frame. These are assigned to the names of dat
Here is the code:
dat <- read.table(text = txt, skip = 2)
labs <- read.table(text = txt, nrows = 2, stringsAsFactors = FALSE)
names(dat) <- sapply(labs, paste, collapse = " ")
dat
names(dat)
The code, when runs produces:
> dat <- read.table(text = txt, skip = 2)
> labs <- read.table(text = txt, nrows = 2, stringsAsFactors = FALSE)
> names(dat) <- sapply(labs, paste, collapse = " ")
>
> dat
Temp degC Press bar Reagent /g Yield A % Conversion etc %
1 1 2 3 4 5
2 6 7 8 9 10
> names(dat)
[1] "Temp degC" "Press bar" "Reagent /g"
[4] "Yield A %" "Conversion etc %"
In your case, you'll want to modify the read.table() calls to point at the file on your file system, so use file = "foo.txt" in place of text = txt in the code chunk, where "foo.txt" is the name of your file.
Also, if these headings don't start at the top of the file, then increase skip to 2+n where n is the number of lines before the two header rows. You'll also need to add skip = n to the second read.table() call which generates labs, where n is again the number of lines before the header lines.
This should work. You only need set stringsAsFactors=FALSE when reading data.
data <- structure(list(Temp = c("degC", "1", "6"), Press = c("bar", "2",
"7"), Reagent = c("/g", "3", "8"), Yield.A = c("%", "4", "9"),
Conversion = c("%", "5", "10")), .Names = c("Temp", "Press",
"Reagent", "Yield.A", "Conversion"), class = "data.frame", row.names = c(NA,
-3L)) # Your data
colnames(data) <-paste(colnames(dados),dados[1,]) # Set new names
data <- data[-1,] # Remove first line
data <- data.frame(apply(data,2,as.real)) # Correct the classes (works only if all collums are numbers)
Just load your file with read.table(file, header = FALSE, stringsAsFactors = F) arguments. Then, you can grep to find the position this happens.
df <- data.frame(V1=c(sample(10), "Temp", "degC"),
V2=c(sample(10), "Press", "bar"),
V3 = c(sample(10), "Reagent", "/g"),
V4 = c(sample(10), "Yield_A", "%"),
V5 = c(sample(10), "Conversion", "%"),
stringsAsFactors=F)
idx <- unique(c(grep("Temp", df$V1), grep("degC", df$V1)))
df2 <- df[-(idx), ]
names(df2) <- sapply(df[idx, ], function(x) paste(x, collapse=" "))
Here, if you want, you can then convert all the columns to numeric as follows:
df2 <- as.data.frame(sapply(df2, as.numeric))