Sapply function in R - r

I have read two .csv files and did some editing.
a1<-read.csv("2013.csv",header=T, na.strings = c("NULL","PrivacySuppressed"))
a2<-a1[,441,drop=F]
a3<-a1[,-441,drop=F]
a4<-cbind(a1,a2)
a4<-a4[, colSums(is.na(a4)) != nrow(a4)]
mode(a4)
[1] "list"
I need the a4 to be an integer so I used sapply
s<-sapply(a4, as.numeric)
mode(s)
[1] "numeric"
However, the problem is, the column names disappeared.
names(s)
NULL
All the previous datas had column names. Sorry it is impossible to type here since there are 600 variables (600 different column names). I had names for my column until a4. After apply "sapply", the names says "NULL". When I just input s, I see the names of the columns but it is not detecting them as names for columns. Please help.
Thank you.

I think the right command is:
a4 <- as.numeric(a4)

Related

When we read csv in R, the numeric values come with prefix X [duplicate]

I asked a question about this a few months back, and I thought the answer had solved my problem, but I ran into the problem again and the solution didn't work for me.
I'm importing a CSV:
orders <- read.csv("<file_location>", sep=",", header=T, check.names = FALSE)
Here's the structure of the dataframe:
str(orders)
'data.frame': 3331575 obs. of 2 variables:
$ OrderID : num -2034590217 -2034590216 -2031892773 -2031892767 -2021008573 ...
$ OrderDate: Factor w/ 402 levels "2010-10-01","2010-10-04",..: 263 263 269 268 301 300 300 300 300 300 ...
If I run the length command on the first column, OrderID, I get this:
length(orders$OrderID)
[1] 0
If I run the length on OrderDate, it returns correctly:
length(orders$OrderDate)
[1] 3331575
This is a copy/paste of the head of the CSV.
OrderID,OrderDate
-2034590217,2011-10-14
-2034590216,2011-10-14
-2031892773,2011-10-24
-2031892767,2011-10-21
-2021008573,2011-12-08
-2021008572,2011-12-07
-2021008571,2011-12-07
-2021008570,2011-12-07
-2021008569,2011-12-07
Now, if I re-run the read.csv, but take out the check.names option, the first column of the dataframe now has an X. at the start of the name.
orders2 <- read.csv("<file_location>", sep=",", header=T)
str(orders2)
'data.frame': 3331575 obs. of 2 variables:
$ X.OrderID: num -2034590217 -2034590216 -2031892773 -2031892767 -2021008573 ...
$ OrderDate: Factor w/ 402 levels "2010-10-01","2010-10-04",..: 263 263 269 268 301 300 300 300 300 300 ...
length(orders$X.OrderID)
[1] 3331575
This works correctly.
My question is why does R add an X. to beginning of the first column name? As you can see from the CSV file, there are no special characters. It should be a simple load. Adding check.names, while will import the name from the CSV, will cause the data to not load correctly for me to perform analysis on.
What can I do to fix this?
Side note: I realize this is a minor - I'm just more frustrated by the fact that I think I am loading correctly, yet not getting the result I expected. I could rename the column using colnames(orders)[1] <- "OrderID", but still want to know why it doesn't load correctly.
read.csv() is a wrapper around the more general read.table() function. That latter function has argument check.names which is documented as:
check.names: logical. If ‘TRUE’ then the names of the variables in the
data frame are checked to ensure that they are syntactically
valid variable names. If necessary they are adjusted (by
‘make.names’) so that they are, and also to ensure that there
are no duplicates.
If your header contains labels that are not syntactically valid then make.names() will replace them with a valid name, based upon the invalid name, removing invalid characters and possibly prepending X:
R> make.names("$Foo")
[1] "X.Foo"
This is documented in ?make.names:
Details:
A syntactically valid name consists of letters, numbers and the
dot or underline characters and starts with a letter or the dot
not followed by a number. Names such as ‘".2way"’ are not valid,
and neither are the reserved words.
The definition of a _letter_ depends on the current locale, but
only ASCII digits are considered to be digits.
The character ‘"X"’ is prepended if necessary. All invalid
characters are translated to ‘"."’. A missing value is translated
to ‘"NA"’. Names which match R keywords have a dot appended to
them. Duplicated values are altered by ‘make.unique’.
The behaviour you are seeing is entirely consistent with the documented way read.table() loads in your data. That would suggest that you have syntactically invalid labels in the header row of your CSV file. Note the point above from ?make.names that what is a letter depends on the locale of your system; The CSV file might include a valid character that your text editor will display but if R is not running in the same locale that character may not be valid there, for example?
I would look at the CSV file and identify any non-ASCII characters in the header line; there are possibly non-visible characters (or escape sequences; \t?) in the header row also. A lot may be going on between reading in the file with the non-valid names and displaying it in the console which might be masking the non-valid characters, so don't take the fact that it doesn't show anything wrong without check.names as indicating that the file is OK.
Posting the output of sessionInfo() would also be useful.
I just came across this problem and it was for a simple reason. I had labels that began with a number, and R was adding an X in front of them all. I think R is confused with a number in the header and applies a letter to differentiate from values.
So, "3_in" became "X3_in" etc...
I solved by switching the label to "in_3" and the issues was resolved.
I hope this helps someone.
When the column names don´t have correct form, R put an "X" at the start of the column name during the import. For example it is usually happening when your column names starts with number or some spacial character. The check.names = FALSE cause it will not happen - there will be no "X".
However some functions may not work if the column names starts with numbers or other special character. Example is rbind.fill function.
So after the application of that function (with "corrected colnames") I use this simple thing to get rid of the "X".
destroyX = function(es) {
f = es
for (col in c(1:ncol(f))){ #for each column in dataframe
if (startsWith(colnames(f)[col], "X") == TRUE) { #if starts with 'X' ..
colnames(f)[col] <- substr(colnames(f)[col], 2, 100) #get rid of it
}
}
assign(deparse(substitute(es)), f, inherits = TRUE) #assign corrected data to original name
}
I ran over a similar problem and wanted to share the following lines of code to correct the column names. Certainly not perfect, since clean programming in the forehand would be better, but maybe helpful as a starting point to someone as quick and dirty approach. (I would have liked to add them as comment to Ryan's question/Gavin's answer, but my reputation is not high enough, so I had to post an additional answer - sorry).
In my case several steps of writing and reading data produced one or more columns named "X", X.1",... containing content in the X-column and row numbers in the X.1,...-columns. In my case the content of the X-column should be used as row names and the other X.1,...-columns should be deleted.
Correct_Colnames <- function(df) {
delete.columns <- grep("(^X$)|(^X\\.)(\\d+)($)", colnames(df), perl=T)
if (length(delete.columns) > 0) {
row.names(df) <- as.character(df[, grep("^X$", colnames(df))])
#other data types might apply than character or
#introduction of a new separate column might be suitable
df <- df[,-delete.columns]
colnames(df) <- gsub("^X", "", colnames(df))
#X might be replaced by different characters, instead of being deleted
}
return(df)
}
I solved a similar problem by including row.names=FALSE as an argument in the write.csv function. write.csv was including the row names as an unnamed column in the CSV file and read.csv was naming that column 'X' when it read the CSV file.

Changing column names or exceptions to strsplit

I have a dataframe Genotypes and it has columns of loci labeled D2S1338, D2S1338.1, CSF1PO, CSF1PO.1, Penta.D, Penta.D.1. These names were automatically generated when I imported the Excel spreadsheet into R such that the for the two columns labeled CSF1PO, the column with the first set of alleles was labeled CSF1PO and the second column was labeled CSF1PO.1. This works fine until I get to Penta D which was listed with a space in Excel and imported as Penta.D. When I apply the following code, Penta.D gets combined with Penta.C and Penta.E to give me nonsensical results:
locuses = unique(unlist(lapply(strsplit(names(Freqs), ".", fixed=TRUE), function(x) x[1])))
Expected <- sapply(locuses, function(x) 1 - sum(unlist(Freqs[grepl(x, names(Freqs))])^2))
This code works great for all loci except the Pentas because of how they were automatically names. How do I either write an exception for the strsplit at Penta.C, Penta.D, and Penta.E or change these names to PentaC, PentaD, and PentaE so that the above code works as expected? I run the following line:
Genotypes <- transform(Genotypes, rename.vars(Genotypes, from="Penta.C", to="PentaC", info=TRUE))
and it tells me:
Changing in Genotypes
From: Penta.C
To: PentaC
but when I view Genotypes, it still has my Penta loci written as Penta.C. I thought this function would write it back to the original data frame, not just a copy. What am I missing here? Thanks for your help.
The first line of your code is splitting the variable names by . and extracting the first piece. It sounds like you instead want to split by . and extract all the pieces except for the last one:
locuses = unique(unlist(lapply(strsplit(names(Freqs), ".", fixed=TRUE),
function(x) paste(x[1:(length(x)-1)], collapse=""))))
Looks like you want to remove ".n" where n is a single digit if and only if it appears at the end of a line.
loci.columns <- read.table(header=F,
text="D2S1338,D2S1338.1,CSF1PO,CSF1PO.1,Penta.D,Penta.D.1",
sep=",")
loci <- gsub("\\.\\d$",replace="",unlist(loci.columns))
loci
# [1] "D2S1338" "D2S1338" "CSF1PO" "CSF1PO" "Penta.D" "Penta.D"
loci <- unique(loci)
loci
# [1] "D2S1338" "CSF1PO" "Penta.D"
In gsub(...), \\. matches ".", \\d matches any digit, and $ forces the match to be at the end of the line.
The basic problem seems like the names are being made "valid" on import by the make.names function
> make.names("Penta C")
[1] "Penta.C"
Avoid R's column re-naming with use of the check.names=FALSE argument to read.table. If you refer explicitly to columns you'll need to provide a back-quoted strings
df$`Penta C`

Extracting out numbers in a list from R

I am reading this from a CSV file, and i need to write a function that churns out a final data frame, so given a particular entry, i have
x
[1] {2,4,5,11,12}
139 Levels: {1,2,3,4,5,6,7,12,17} ...
i can change it to
x2<-as.character(x)
which gives me
x
[1] "{2,4,5,11,12}"
how do i extract 2,4,5,11,12 out? (having 5 elements)
i have tried to use various ways, like gsub, but to no avail
can anyone please help me?
It sounds like you're trying to import a database table that contains arrays. Since R doesn't know about such data structures, it treats them as text.
Try this. I assume the column in question is x. The result will be a list, with each element being the vector of array values for that row in the table.
dat <- read.csv("<file>", stringsAsFactors=FALSE)
dat$x <- strsplit(gsub("\\{(.*)\\}", "\\1", dat$x), ",")

Write different datatype values to a file in R

Is it possible to write values of different datatypes to a file in R? Currently, I am using a simple vector as follows:
> vect = c (1,2, "string")
> vect
[1] "1" "2" "string"
> write.table(vect, file="/home/sampleuser/sample.txt", append= FALSE, sep= "|")
However, since vect is a vector of string now, opening the file has following contents being in quoted form as:
"x"
"1"|"1"
"2"|"2"
"3"|"string"
Is it not possible to restore the data types of entries 1 and 2 being treated as numeric value instead of string. So my expected result is:
"x"
"1"|1
"2"|2
"3"|"string"
also, I am assuming the left side values "1", "2" and "3" are vector indexes? I did not understand how the first line is "x"?
I wonder if simply removing all the quotes from the output file will solve your problem? That's easy: Add quote=FALSE to your write.table() call.
write.table(vect, file="/home/sampleuser/sample.txt",
append=FALSE, sep="|", quote=FALSE)
x
1|1
2|2
3|string
Also, you can get rid of the column and row names if you like. But now your separator character doesn't appear because you have a one-column table.
write.table(vect, file="/home/sampleuser/sample.txt", append=FALSE, sep="|",
quote=FALSE, row.names=FALSE, col.names=FALSE)
1
2
string
For vectors and matrices, R requires everything to have the same data type. By default, R will coerce all of the data in the vector/matrix into the same format. R will coerce more specific types of data into less specific data types. In this case, any of the items stored in your vector can be reasonably represented as type "character", so it will automatically coerce the numeric parts of the vector to fit that data type.
As #Dason said, you're better off using a list if this isn't something you want.
Alternatively, you can use a data.frame, which lets you store different datatypes in different columns (internally, R stores data.frames as lists, so it makes sense that this would be another option).

Why am I getting X. in my column names when reading a data frame?

I asked a question about this a few months back, and I thought the answer had solved my problem, but I ran into the problem again and the solution didn't work for me.
I'm importing a CSV:
orders <- read.csv("<file_location>", sep=",", header=T, check.names = FALSE)
Here's the structure of the dataframe:
str(orders)
'data.frame': 3331575 obs. of 2 variables:
$ OrderID : num -2034590217 -2034590216 -2031892773 -2031892767 -2021008573 ...
$ OrderDate: Factor w/ 402 levels "2010-10-01","2010-10-04",..: 263 263 269 268 301 300 300 300 300 300 ...
If I run the length command on the first column, OrderID, I get this:
length(orders$OrderID)
[1] 0
If I run the length on OrderDate, it returns correctly:
length(orders$OrderDate)
[1] 3331575
This is a copy/paste of the head of the CSV.
OrderID,OrderDate
-2034590217,2011-10-14
-2034590216,2011-10-14
-2031892773,2011-10-24
-2031892767,2011-10-21
-2021008573,2011-12-08
-2021008572,2011-12-07
-2021008571,2011-12-07
-2021008570,2011-12-07
-2021008569,2011-12-07
Now, if I re-run the read.csv, but take out the check.names option, the first column of the dataframe now has an X. at the start of the name.
orders2 <- read.csv("<file_location>", sep=",", header=T)
str(orders2)
'data.frame': 3331575 obs. of 2 variables:
$ X.OrderID: num -2034590217 -2034590216 -2031892773 -2031892767 -2021008573 ...
$ OrderDate: Factor w/ 402 levels "2010-10-01","2010-10-04",..: 263 263 269 268 301 300 300 300 300 300 ...
length(orders$X.OrderID)
[1] 3331575
This works correctly.
My question is why does R add an X. to beginning of the first column name? As you can see from the CSV file, there are no special characters. It should be a simple load. Adding check.names, while will import the name from the CSV, will cause the data to not load correctly for me to perform analysis on.
What can I do to fix this?
Side note: I realize this is a minor - I'm just more frustrated by the fact that I think I am loading correctly, yet not getting the result I expected. I could rename the column using colnames(orders)[1] <- "OrderID", but still want to know why it doesn't load correctly.
read.csv() is a wrapper around the more general read.table() function. That latter function has argument check.names which is documented as:
check.names: logical. If ‘TRUE’ then the names of the variables in the
data frame are checked to ensure that they are syntactically
valid variable names. If necessary they are adjusted (by
‘make.names’) so that they are, and also to ensure that there
are no duplicates.
If your header contains labels that are not syntactically valid then make.names() will replace them with a valid name, based upon the invalid name, removing invalid characters and possibly prepending X:
R> make.names("$Foo")
[1] "X.Foo"
This is documented in ?make.names:
Details:
A syntactically valid name consists of letters, numbers and the
dot or underline characters and starts with a letter or the dot
not followed by a number. Names such as ‘".2way"’ are not valid,
and neither are the reserved words.
The definition of a _letter_ depends on the current locale, but
only ASCII digits are considered to be digits.
The character ‘"X"’ is prepended if necessary. All invalid
characters are translated to ‘"."’. A missing value is translated
to ‘"NA"’. Names which match R keywords have a dot appended to
them. Duplicated values are altered by ‘make.unique’.
The behaviour you are seeing is entirely consistent with the documented way read.table() loads in your data. That would suggest that you have syntactically invalid labels in the header row of your CSV file. Note the point above from ?make.names that what is a letter depends on the locale of your system; The CSV file might include a valid character that your text editor will display but if R is not running in the same locale that character may not be valid there, for example?
I would look at the CSV file and identify any non-ASCII characters in the header line; there are possibly non-visible characters (or escape sequences; \t?) in the header row also. A lot may be going on between reading in the file with the non-valid names and displaying it in the console which might be masking the non-valid characters, so don't take the fact that it doesn't show anything wrong without check.names as indicating that the file is OK.
Posting the output of sessionInfo() would also be useful.
I just came across this problem and it was for a simple reason. I had labels that began with a number, and R was adding an X in front of them all. I think R is confused with a number in the header and applies a letter to differentiate from values.
So, "3_in" became "X3_in" etc...
I solved by switching the label to "in_3" and the issues was resolved.
I hope this helps someone.
When the column names don´t have correct form, R put an "X" at the start of the column name during the import. For example it is usually happening when your column names starts with number or some spacial character. The check.names = FALSE cause it will not happen - there will be no "X".
However some functions may not work if the column names starts with numbers or other special character. Example is rbind.fill function.
So after the application of that function (with "corrected colnames") I use this simple thing to get rid of the "X".
destroyX = function(es) {
f = es
for (col in c(1:ncol(f))){ #for each column in dataframe
if (startsWith(colnames(f)[col], "X") == TRUE) { #if starts with 'X' ..
colnames(f)[col] <- substr(colnames(f)[col], 2, 100) #get rid of it
}
}
assign(deparse(substitute(es)), f, inherits = TRUE) #assign corrected data to original name
}
I ran over a similar problem and wanted to share the following lines of code to correct the column names. Certainly not perfect, since clean programming in the forehand would be better, but maybe helpful as a starting point to someone as quick and dirty approach. (I would have liked to add them as comment to Ryan's question/Gavin's answer, but my reputation is not high enough, so I had to post an additional answer - sorry).
In my case several steps of writing and reading data produced one or more columns named "X", X.1",... containing content in the X-column and row numbers in the X.1,...-columns. In my case the content of the X-column should be used as row names and the other X.1,...-columns should be deleted.
Correct_Colnames <- function(df) {
delete.columns <- grep("(^X$)|(^X\\.)(\\d+)($)", colnames(df), perl=T)
if (length(delete.columns) > 0) {
row.names(df) <- as.character(df[, grep("^X$", colnames(df))])
#other data types might apply than character or
#introduction of a new separate column might be suitable
df <- df[,-delete.columns]
colnames(df) <- gsub("^X", "", colnames(df))
#X might be replaced by different characters, instead of being deleted
}
return(df)
}
I solved a similar problem by including row.names=FALSE as an argument in the write.csv function. write.csv was including the row names as an unnamed column in the CSV file and read.csv was naming that column 'X' when it read the CSV file.

Resources