write.txt constantly adds col.names when col.names=FALSE is specified - r

I tried to write a dataframe into a txt file without heading, but it kept adding column names. When I opened the file directly from the drive, it has 21 rows without the heading, but when I opened the file using read.delim(), I can see headers with some symbols.
Here is the code
write.table(trans_sequence, file="mytxtout.txt", sep=";", col.names =FALSE, row.names = FALSE,
quote = FALSE)
When I retrieved the data using read.delim, it looks like below. It should have 21 rows, but it made the top row a column name, making 20 rows. The first row should be like this
2745;9;2;HbA1c;LDL;C;Tests
But it created a header instead
read.delim("mytxtout.txt")
X2745.9.2.HbA1c.LDL.C.Tests
1 10433;9;2;BMI;Blood Pressure
2 13601;0;1;LDL-C Tests
3 13601;6;1;LDL-C Tests
4 36127;2;2;BMI;Blood Pressure
5 36127;5;1;Blood Pressure
6 36127;9;2;BMI;Blood Pressure
7 36127;10;2;BMI;Blood Pressure
8 54881;9;2;HbA1c;LDL-C Tests
9 59650;0;2;BMI;Blood Pressure
10 59650;3;2;BMI;Blood Pressure
11 66741;0;1;LDL-C Tests
12 72772;3;1;LDL-C Tests
13 77618;2;3;BMI;BMI Percentile;Blood Pressure
14 77618;3;2;BMI;BMI Percentile
15 81397;4;1;BMI
16 81397;6;2;BMI;Blood Pressure
17 81397;9;2;BMI;Blood Pressure
18 81397;9;1;BMI
19 83520;6;3;BMI;BMI Percentile;Blood Pressure
20 85178;10;1;LDL-C Tests
Any help will be greatly appreciated

Related

How to read a textfile line by line in R and search for a special string?

I've got thousands of textfiles which 10-thousands of lines in different structure in a textfile. It looks like the following 3 lines:
DATE#2020-10-08#TIME#16:00:04#__JOBTYPE#ANFRAGE#__PATH#16 16 16 16 16#REFERENZ=23#REFERENZ*23°__PATH°16 16#
DATE#2020-10-08#__JOBTYPE#ANFRAGE#__PATH#16 16 16 16 16#REFERENZ*24°__PATH°16 16#
DATE#2020-10-08#TIME#16:00:04#__JOBTYPE#ANFRAGE#REFERENZ=25#__PATH#17 16 16 18 16
A # symbolizes normally a break between name of data and information. Sometimes there is another deeper level where # changes to ° and = changes to *. The lines in the original data have got about 10.000 signs per line. I am searching in each line just for the REFERENZ which can apear multiple times. E.g. in line 1.
The result of the read-function for this 3 lines should be a data.frame like this:
> Daten = data.frame(REFERENZ = c(23,24,25))
> str(Daten)
'data.frame': 3 obs. of 1 variable:
$ REFERENZ: num 23 24 25
Dies anybody knows a function in R which can search for this?
I am using read_lines()function from readr package for problem like that.
library(readr)
library(data.table)
t1 <- read_lines('textfile.txt')
table <- fread(paste0(t1, collapse = '\n'), sep = '#')
EDIT:
I misunderstood the question, my bad. I think you are looking for REGEX.
library(readr)
library(stringr)
t1 <- 'DATE#2020-10-08#TIME#16:00:04#__JOBTYPE#ANFRAGE#__PATH#16 16 16 16 16#REFERENZ=23#REFERENZ*23°__PATH°16 16#
DATE#2020-10-08#__JOBTYPE#ANFRAGE#__PATH#16 16 16 16 16#REFERENZ*24°__PATH°16 16#
DATE#2020-10-08#TIME#16:00:04#__JOBTYPE#ANFRAGE#REFERENZ=25#__PATH#17 16 16 18 16'
t1 <- read_lines(t1)
Daten = data.frame(REFERENZ = str_extract(str_extract(t1, 'REFERENZ\\W\\d*'), '[0-9]+'))
str(Daten)

How to count number of instances above a value within a given range in R?

I have a rather large dataset looking at SNPs across an entire genome. I am trying to generate a heatmap that scales based on how many SNPs have a BF (bayes factor) value over 50 within a sliding window of x base pairs across the genome. For example, there might be 5 SNPs of interest within the first 1,000,000 base pairs, and then 3 in the next 1,000,000, and so on until I reach the end of the genome, which would be used to generate a single row heatmap. Currently, my data are set out like so:
SNP BF BP
0001_107388 11.62814713 107388
0001_193069 2.333472447 193069
0001_278038 51.34452334 278038
0001_328786 5.321968927 328786
0001_523879 50.03245434 523879
0001_804477 -0.51777189 804477
0001_990357 6.235452787 990357
0001_1033297 3.08206707 1033297
0001_1167609 -2.427835577 1167609
0001_1222410 52.96447989 1222410
0001_1490205 10.98099565 1490205
0001_1689133 3.75363951 1689133
0001_1746080 3.519987207 1746080
0001_1746450 -2.86666016 1746450
0001_1777011 0.166999413 1777011
0001_2114817 3.266942137 2114817
0001_2232084 50.43561123 2232084
0001_2332903 -0.15022324 2332903
0001_2347062 -1.209000033 2347062
0001_2426273 1.230915683 2426273
where SNP = the SNP ID, BF = the bayes factor, and BP = the position on the genome (I've fudged a couple of > 50 values in there for the data to be suitable for this example).
The issue is that I don't have a SNP for each genome position, otherwise I could simply split the windows of interest based on line count and then count however many lines in the BF column are over 50. Is there any way I can I count the number of SNPs of interest within different windows of the genome positions? Preferably in R, but no issues with using other languages like Python or Bash if it gets the job done.
Thanks!
library(slider); library(dplyr)
my_data %>%
mutate(count = slide_index(BF, BP, ~sum(.x > 50), .before = 999999))
This counts how many BF > 50 in the window of the last 1M in BP.
SNP BF BP count
1 0001_107388 11.6281471 107388 0
2 0001_193069 2.3334724 193069 0
3 0001_278038 51.3445233 278038 1
4 0001_328786 5.3219689 328786 1
5 0001_523879 50.0324543 523879 2
6 0001_804477 -0.5177719 804477 2
7 0001_990357 6.2354528 990357 2
8 0001_1033297 3.0820671 1033297 2
9 0001_1167609 -2.4278356 1167609 2
10 0001_1222410 52.9644799 1222410 3
11 0001_1490205 10.9809957 1490205 2
12 0001_1689133 3.7536395 1689133 1
13 0001_1746080 3.5199872 1746080 1
14 0001_1746450 -2.8666602 1746450 1
15 0001_1777011 0.1669994 1777011 1
16 0001_2114817 3.2669421 2114817 1
17 0001_2232084 50.4356112 2232084 1
18 0001_2332903 -0.1502232 2332903 1
19 0001_2347062 -1.2090000 2347062 1
20 0001_2426273 1.2309157 2426273 1

Retrieve Stata variable notes in R

I have imported a Stata dta file into R using readstata13 package.
The variables have notes which contain full length of questions. I found the attr() function with which I can do a few things such as extract variable names (attr(df, name)), extract variable labels (attr(df, "var")), and label values (attr(df, "label")). However, I have not found a way to extract variable notes.
Is there a way to do so?
Below are a few lines of Stata code that produce a dta file with two variables and variable notes, which can be imported into R.
clear
input int(mpg weight)
34 1800
18 3670
21 4060
15 3720
19 3400
41 2040
25 1990
28 3260
30 1980
12 4720
end
note mpg: Mileage (mpg)
note weight: Weight (lbs.)
save "~/mpg_weight.dta", replace
EDIT:
You can actually do this directly in newer versions of readstata13() as follows:
df = read.dta13("~/mpg_weight.dta")
notes = attr(df, "expansion.fields")
This will generate a list providing variable name, characteristic name and the contents of the Stata characteristic field.
Here's a quick workaround using your toy example:
clear
input int(mpg weight)
34 1800
18 3670
21 4060
15 3720
19 3400
41 2040
25 1990
28 3260
30 1980
12 4720
end
note mpg: this is the first note
note mpg: and this is the second
note mpg: here's a third
note weight: Weight (lbs.)
save "~/mpg_weight.dta", replace
ds
local varlist `r(varlist)'
foreach var of local varlist {
generate notes_`var' = ""
forvalues i = 1 / ``var'[note0]' {
replace notes_`var' = "``var'[note`i']'" in `i'
}
}
export delimited notes_* using notes_mpg_weight.dta.csv, replace
You can then simply import everything in R as strings and go from there.

Is there anyway to delete rownames in R?

I made a table like this table name a.
variable relative_importance scaled_importance percentage
1 x005 68046.078125 1.000000 0.195396
2 x004 63890.796875 0.938934 0.183464
3 x007 48253.820312 0.709134 0.138562
4 x012 43492.117188 0.639157 0.124889
5 x008 43132.035156 0.633865 0.123855
6 x013 32495.070312 0.477545 0.093310
7 x009 18466.910156 0.271388 0.053028
8 x015 10625.453125 0.156151 0.030511
9 x010 8893.750977 0.130702 0.025539
10 x014 4904.361816 0.072074 0.014083
11 x002 1812.269531 0.026633 0.005204
12 x001 1704.574585 0.025050 0.004895
13 x006 1438.692139 0.021143 0.004131
14 x011 1080.584106 0.015880 0.003103
15 x003 10.152302 0.000149 0.000029
and use this code to order that table.
setorder(a,variable)
and want to get only second column.
a[2]
relative_importance
12 380.4296
11 645.4594
15 10.1440
4 8599.7715
2 10749.5752
13 263.7065
5 8434.3760
6 7443.8530
7 3602.8850
10 935.6713
14 256.7183
3 9160.4062
1 12071.1826
9 1173.0701
8 1698.0955
I want to copy "relative_importance" and paste in Excel.
But, I couldn't delete the rownames. (12,11,15...,9,8)
Is there any way to print only "relative_importance"? (print without rownames or hide rownames)
Thank you :)
You could simply use writeClipboard( as.character(a$relative_importance) ) and paste it in Excel
You could create a csv file, which you can open with Excel.
write.csv(a[2], "myfile.csv", row.names = FALSE, col.names = FALSE.
Note that the file will be created in your current working directory, which you can find by running the following code: getwd().
On a different note, are you trying to get the column into Excel for further analysis? If you are, I encourage you to learn how to do that analysis in R.

How to Find difference between two values of last two dates in R program

DF2
Date EMMI ACT NO2
2011/02/12 12345 21 11
2011/02/14 43211 22 12
2011/02/19 12345 21 13
2011/02/23 43211 13 12
2011/02/23 56341 13 12
2011/03/03 56431 18 20
I need to find difference between two dates in a column. For example difference between ACT column values.For example, the EMMI 12345, Difference between dates 2011/02/19 - 2011/02/12 = 21-21 = 0. like that i want to do for entire column of ACT. Add a new column diff and add values to that. Can anybody let me know please how to do it.
This is the output i want
DF3
Date EMMI ACT NO2 DifACT
2011/02/12 12345 21 11 NA
2011/02/14 43211 22 12 NA
2011/02/19 12345 21 13 0
2011/02/23 43211 13 12 -9
2011/02/23 56341 13 12 5
Try this:
DF3 <- DF2
DF3$difACT <- ave( DF3$ACT, DF3$EMMI, FUN= function(x) c(NA, diff(x)) )
As long as the dates are sorted (within EMMI) this will work, if they are not sorted then we would need to modify the above to sort within EMMI first. I would probably sort the entire data frame on date first (and save the results of order), then run the above. Then if you need it back in the original order you can run order on the results of the original order results to "unorder" the data frame.
This is based on plyr package (not tested):
library(plyr)
DF3<-ddply(DF2,.(EMMI),mutate,difACT=diff(ACT))

Resources