obtain value from string in column headers in R - r

I have a text file that looks like the following
DateTime height0.1 height0.2
2009-01-01 00:00 1 1
2009-01-02 00:00 2 4
2009-01-03 00:00 10 1
Obviously this is just an example and the actual file contains a lot more data i.e. contains about 100 column, and the header can have values in decimals. I can read the file into R with the following:
dat <- read.table(file,header = TRUE, sep = "\t")
where file is the path of the table. This creates a data.frame in the workspace called dat. I would now like to generate a variable from this data.frame called 'vars' which is an array made up of the numbers in the column headers (except from DateTime which is the first column).
for example, here I would have vars = 1,2
Basically I want to take the number that is in the string of the header and then store this in a separate variable. I realize that this will be extremely easy for some, but any advice would be great.

If all the numbers you've are at the end of the names, for example, not like h984mm19, then, you can just remove everything except numbers and punctuations using gsub and convert it to numeric vector as follows:
# just give all names except the first column
my_var <- as.numeric(gsub("[^0-9[:punct:]]", "", names(dat)[-1]))
# [1] 0.1 0.2

Related

How to replace only the final character of multiple variable names in R?

Below is some background information about my dataset if you want to understand where my question comes from (I actually want to merge datasets, so maybe somebody knows a more efficient way).
The question:
How to replace only the final character of a variable name in R with nothing (for multiple variables)?
I tried using the sub() function and it worked fine, however, some variable names contain the character I want to change multiple times (e.g. str2tt2). I only want to 'remove' or replace the last '2' with blank space.
Example:
Suppose I have a dataset with these variable names, and I only want to remove the last '_2' characters, I tried this:
h_2ello_2 how_2 are_2 you_2
1 1 3 5 7
2 2 4 6 8
names(data) <- sub('_2', '', names(data))
Output:
hello_2 how are you
1 1 3 5 7
2 2 4 6 8
Now, I want my code to remove the last '_2', so that it returns 'h_2ello' instead of hello_2.
Does anyone know how to?
Thank you in advance!
Background information:
I am currently trying to build a dataset from three separate ones. These three different ones are from three different measurement moments, and thus their variable names include a character after each variable name respective to their measurement moment. That is, for measurement moment 2, the variable names are scoreA2, scoreB2, scoreC2 and for measurement moment 3, the variable names are scoreA3, scoreB3 and scoreC3.
Since I want to merge these files together, I want to remove the '2' and '3' in the datasets and then merge them so that it seems like everyone was measured at the same moment.
However, some score names include the character 2 and 3 as well. For example: str2tt2 is the variable name for Stroop card 2 total time measurement moment 2. I only want to remove the last '2', but when using the sub() function I only remove the first one.
We need to use the metacharacter $ suggesting the end of the string on the original dataset column names
names(data) <- sub('_2$', '', names(data))
names(data)
#[1] "h_2ello" "how" "are" "you"
In the OP's code, the _2 matches the first instance in h_2ello_2 as it is sub and removes the _2 from h_2. Instead we need to specify the position to be the last characters of the string.

dealing with blank/missing data with write.table in R

I have a data frame where some of the rows have blanks entries, e.g. to use a toy example
Sample Gene RS Chromosome
1 A rs1 10
2 B X
3 C rs4 Y
i.e. sample 2 has no rs#. If I attempt to save this data frame in a file using:
write.table(mydata,file="myfile",quote=FALSE,sep='\t')
and then read.table('myfile',header=TRUE,sep='\t'), I get an error stating that the number of entries in line 2 doesn't have 4 elements. If I set quote=TRUE, then a "" entry appears in the table. I'm trying to figure out a way to create a table using write.table with quote=FALSE while retaining a blank placeholder for rows with missing entries such as 2.
Is there a simple way to do this? I attempted to use the argument NA="" in write.table() but this didn't change anything.
If result of my script's data frame has NA I always replace it , One way would be to replace NA in the data frames with a some other text which tells you that this entry was NA in the data frame -Specially if you are saving the result in a csv /database or some non -R env
a simple script to do that
replace_NA <- function(x,replacement="N/A"){
x[is.na(x)==T] <- replacement
}
sapply(df,replace_NA,replacement ="N/A" )
You are attempting to reinvent the fixed-width file format. Your requested format would have a blank column between every real column. I don't find a write.fwf, although the 'utils' package has read.fwf. The simplest method of getting your requested output would be:
capture.output(dat, file='test.dat')
# Result in a text file
Sample Gene RS Chromosome
1 1 A rs1 10
2 2 B X
3 3 C rs4 Y
This essentially uses the print method (at the end of the R REPL) for dataframes to do the spacing for you.

Counting NA values by ID?

I'm learning R from scratch right now and am trying to count the number of NA's within a given table, aggregated by the ID of the file it came from. I then want to output that information in a new data frame, showing just the ID and the sum of the NA lines contained within. I've looked at some similar questions, but they all seem to deal with very short datasets, whereas mine is comparably long (10k + lines) so I can't call out each individual line to aggregate.
Ideally, if I start with a data table called "Data" with a total of four columns, and one column called "ID", I would like to output a data frame that is simply:
[ID] [NA_Count]
1 500
2 352
3 100
Thanks in advance...
Something like the following should work, although I am assuming that Date is always there and Field 1 and Field 2 are numeric:
# get file names and initialize a vector for the counts
fileNames <- list.files(<filePath>)
missRowsVec <- integer(length(fileNames))
# loop through files, get number of
for(filePos in 1:length(fileNames)) {
# read in files **fill in <filePath>**
temp <- read.csv(paste0(<filePath>, fileNames[filePos]), as.is=TRUE)
# count the number of rows with missing values,
# ** fill in <fieldName#> with strings of variable names **
missRowsVec[filePos] <- sum(apply(temp[, c(<field1Name>, <field2Name>)],
function(i) anyNA(i)))
} # end loop
# build data frame
myDataFrame <- data.frame("fileNames"=fileNames, "missCount"=missRowsVec)
This may be a bit dense, but it should work more or less. Try small portions of it, like just some inner function, to see how stuff works.

How can I ignore null headers in a .csv file?

How can I ignore null headers in a .csv file?
I have a csv file like this
http://190.12.101.70/~digicelc/gestion/reportes/import/liquidacion/13958642917519.csv
and my code is
data <- read.csv('1.csv',header = T, sep=";")
So R tells me
more columns than column names
And I don't want to skip the header of the file
thank you!
I don't see the same behavior here. R adds default column names and NA to unavailable data.
> data <- read.csv("test.csv", header = TRUE, sep = ";")
> data
col1 col2 col3 col4 X X.1
1 val1 val2 val3 val4 val5 NA
2 val1 val2 val3 val4 val5 NA
Are you using the latest version?
But the error message tells you exactly what the problem is. you have more columns than column names.
download.file("http://190.12.101.70/~digicelc/gestion/reportes/import/liquidacion/13958642917519.csv", destfile="1.csv")
D1 <- read.csv2("1.csv", skip=1, header=FALSE)
firstlines <- readLines("1.csv", 3)
splitthem <- strsplit(firstlines, ";")
sapply(splitthem, length)
# [1] 28 42 42
So you have 42 data columns (separated by semicolons) but 28 column names (again, separated by column names). How would R know which name you would want to go with which column? ("Computers are good at following instructions, but not at reading your mind." - Donald Knuth).
You need to edit the source file so that each column would have a name or skip the first row and then get the column names form somewhere else.
edit
yes, the idea is to take the first names and then standart variables like V1, V2, or
whatever- Otherwise is there a way to skip those?
Ok then I would just use the above with slight modification:
download.file("http://190.12.101.70/~digicelc/gestion/reportes/import/liquidacion/13958642917519.csv", destfile="1.csv")
D <- read.csv2("1.csv", skip=1, header=FALSE)
header <- strsplit(readLines("1.csv", 3), ";")[[1]]
names(D)[1:length(header)] <- header
Now you have the first 28 variables named, and the rest named V29-V42.
You can "skip" the rest of the names in various ways. If you do as suggested in another answer (Dave) , basically as
names(D) <- header
... then variables 29-42 will have NA name. This is not a usable name, and you can address these variables only by column number. Or you can do:
names(D)[29:43] <- ""
Now you can't use these names either.
> D[[""]]
NULL
I think it is useful to give them names, as many data frame operations presume names. For example, suppose you have empty names ("" as above) and try to see the first few rows of your data frame:
head(D)
# skipped most of the output, keeping only column 42:
structure(c("-1", "70", ".5", "70", "266", "70"), class = "AsIs")
1 -1
2 70
3 .5
4 70
5 266
6 70
So when using head, you will see your data frame with funny names. Or another example:
D[1:3,29:31]
.1 .2
1 C_COMPONENTE_LIQ_DESDE_CO 243 LIQUIDACION TOPE CO
2 C_COMPONENTE_LIQ_DESDE_CO 243 RESIDUAL CO
3 C_COMPONENTE_LIQ_DESDE_CO 243 RESIDUAL CO
The first component now is named "", the second one ".1", and the third one ".2". Have a look at a quote from data.frame help file below:
The column names should be non-empty, and attempts to use empty names will have
unsupported results. Duplicate column names are allowed, but you need to use check.names
= FALSE for data.frame to generate such a data frame. However, not all operations on
data frames will preserve duplicated column names: for example matrix-like subsetting
will force column names in the result to be unique.
Or suppose you add some columns to the beginning of your data frame; if you have col names then you can still address what was previously 29th column as D$V29, but with D[,29] you will get something else.
Probably there are other examples. In other words, you can have "unnamed" columns in a data frame but I don't think it is a good idea. And technically, all columns in a data frame will always have a name (it can just be "" or NA), so why not have meaningful names? (Even V29 is better than nothing.)

read.table and comments in R

I'd like to add metadata to my spreadsheet as comments, and have R ignore these afterwards.
My data are of the form
v1,v2,v3,
1,5,7,
4,2,1,#possible error,
(which the exception that it is much longer. the first comment actually appears well outside of the top 5 rows, used by scan to determine the number of columns)
I've been trying:
read.table("data.name",header=TRUE,sep=",",stringsAsFactors=FALSE,comment.char="#")
But read.table (and, for that matter, count.fields) thinks that I have one more field than I actually do. My data frame ends up with a blank column called 'X'. I think this is because my spreadsheet program adds commas to the end of every line (as in the above example).
Using flush=TRUE has no effect, even though (according to the help file) it " [...] allows putting comments after the last field [...]"
Using colClasses=c(rep(NA,3),NULL) has no effect either.
I could just delete the column afterwards, but since it seems that this is a common practice I'd like to learn how to do it properly.
Thanks,
Andrew
From the doc (?read.table):
colClasses character. A vector of classes to be assumed for the columns. Recycled as necessary, or if the character vector is named, unspecified values are taken to be NA.
Possible values are NA (the default, when type.convert is used), "NULL" (when the column is skipped), one of the atomic vector classes (logical, integer, numeric, complex, character, raw), or "factor", "Date" or "POSIXct". Otherwise there needs to be an as method (from package methods) for conversion from "character" to the specified formal class.
Note that it says to use "NULL", not NULL. Indeed, this works as expected:
con <- textConnection("
v1,v2,v3,
1,5,7,
4,2,1,#possible error,
")
read.table(con, header = TRUE, sep = ",",
stringsAsFactors = FALSE, comment.char = "#",
colClasses = c(rep(NA, 3), "NULL"))
# v1 v2 v3
# 1 1 5 7
# 2 4 2 1
Your issue regarding the comment character and the number of data columns are unrelated to read.table() but not to your spreadsheet (I'm using Excel). The default behavior for read.table is to treat # as the beginning of a comment and ignore what follows. The reason you are getting an error is because there is a trailing comma at the end of your data lines. That tells read.table that more data should follow. Reading your original example:
> read.table(text="v1, v2, v3,
+ 1,5,7,
+ 4,2,1,#possible error,", sep=",", header=TRUE)
v1 v2 v3 X
1 1 5 7 NA
2 4 2 1 NA
The comment is ignored by default and a fourth column is created and labeled X. You could easily delete this column after the fact or use the method that #flodel mentions or you can remove the trailing comma before reading the file into R. In Excel, the trailing comma is added when you save a file as csv (comma separated variables) because the comment appears in the fourth column and Excel doesn't recognize it as a comment. If you save the file as space-separated, the problem goes away (remove the sep= argument since the space is the default separator):
> read.table(text="v1 v2 v3
+ 1 5 7
+ 4 2 1#possible error", header=TRUE)
v1 v2 v3
1 1 5 7
2 4 2 1

Resources