I am trying to use textcat package for n-gram analysis, which has the following function:
textcat(x, p = TC_char_profiles, method = "CT", ..., options = list())
The function specification indicates that
The argument x can be a character vector of texts, or an R object which can be coerced to this using as.character.
I do not know what does the "R object which can be coerced to this using as.character" mean? In other words, I do not quite understand what should be the correct input format for this x in accordance with the above description. Suppose I have a 100 documents. How to transfer these documents into the format of x?
You really have two questions here.
(1). What does the "R object which can be coerced to this using as.character" mean?
That means that other classes of R object can be passed in, in place of one that is just character. An example is a factor, where as.character(x) will drop the extra features provided and revert to a simple character vector.
as.character(1:2) ## will give a vector c("1", "2")
This extends for other derived classes, and it's a standard R idiom to provide a method for common functions like as.character that define a coercion from any given class to character.
(2). In what format must my data be to input to textcat?
In short, it must be a character vector or something that can be coerced to one. You are asking about documents, so presumably you have text files. The function readLines will provide a character vector from a text file, a vector as long as the number of lines in the file. Any more for this question needs a lot more detail from you about what the analysis is supposed to do, does it need to be broken into lines of text from a file? Broken into words? Keep sets of lines/words from different files as separate sets? And so on.
In really simplistic terms using the example in readLines, you could do something like this but further detail needs more information for your question:
cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file="ex.data",
sep="\n")
readLines("ex.data", n=-1)
x <- readLines("ex.data", n=-1)
require(textcat)
textcat(x)
Related
I am pretty new to R, about 3 months in, and when I was trying to run a regression R shot me this error, Error: unexpected input in "reg1 <- lm(_" the variable I use has an underscore, and some other variables too, I didn't know if R support underscore in a regression or not as thats the first time I had a variable with an underscore in it's name. If it doesn't, how can I change to name?
As good practice, always begin variable/column names with letters (although not explicitly the rule, and you can technically start with the period symbol, this will save hassle). When dealing with data imported into R with predefined column names (or just when dealing with dataframes in general) you can rename columns in the dataframe df as follows
names(df)[names(df) == 'OldName'] <- 'NewName'
If you really need to, you can protect 'illegal' names with back-quotes (although I agree with other answers/comments that this is not good practice ...)
dd <- data.frame(`_y`=rnorm(10), x = 1:10, check.names=FALSE)
names(dd)
## [1] "_y" "x"
lm(`_y` ~ x, data = dd)
I am new to R, please have mercy. I imported a table from an Access database via odbc:
df <- select(dbReadTable(accdb_path, name ="accdb_table"),"Col_1","Col_2","Col_3")
For
> typeof(df$Col_3)
I get
[1] "list"
Using library(dplyr.teradata). I converted blob to string (maybe already on the wrong path here):
df$Hex <- blob_to_string(df$Col_3)
and now end up with a column (typeof = character) full of Hex:
df[1,4]
[1] 49206765742061206c6f74206f662048657820616e642068617665207468652069737375652077697468207370656369616c2063687261637465727320696e204765726d616e206c616e6775616765206c696b65206e2b4150592d7
My question is, how to convert each value in Col_3 into proper Text (if possible, with respect to German special chracters like ü,ö, ä and ß).
I am aware of this solution How to convert a hex string to text in R?, but can't apply it properly:
df$Text <- rawToChar(as.raw(strtoi(df$Hex, 16L)))
Fehler in rawToChar(as.raw(strtoi(BinData$Hex, 16L))) :
Zeichenkette '\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\
Thx!
If I understand this correctly, what you want to do it to apply a function to each element of a list so that it returns a character vector (that you can add to a data frame, if you so wish).
This can be easily accomplished with the purrr family of functions. The following takes each element df$Col_3 and runs the function (with each element being the x in the given function)
purrr::map_chr(.x = df$Col_3,
.f = function(x) {rawToChar(as.raw(strtoi(x,16L)))})
You should probably achieve the same with base R functions such as lapply() followed by unlist(), or sapply() but with purrr it's often easier to find inconsistent results.
I'm trying to read an Excel file into R.
I used read_excel function of the readxl package with parameter col_types = "text" since the columns of the Excel sheet contain mixed data types.
df <- read_excel("Test.xlsx",sheet="Sheet1",col_types = "text")
But it appears a very slight difference in the numeric value is introduced. It's always those few values so I think it's some hidden attributes in Excel.
I tried to format those values as numbers in Excel, and also tried add 0s after the number, but it won't work.
I changed the numeric value of a cell from 2.3 to 2.4, and it was read correctly by R.
This is a consequence of floating-point imprecision, but it's a little tricky. When you enter the number 1.2 (for example) into R or Excel, it's not represented exactly as 1.2:
print(1.2,digits=22)
## [1] 1.199999999999999955591
Excel and R usually try to shield you from these details, which are inevitable if you're using fixed precision floating-point values (which most computer systems do), by limiting the printing precision to a level that will ignore those floating-point imprecisions. When you explicitly convert to character, however, R figures you don't want to lose information, so it gives you all the digits. Numbers that can be represented exactly in a binary representation, such as 2.375, don't gain all those extra digits.
However, there's a simple solution in this case:
readxl::read_excel("Test.xlsx", na="ND")
This tells R that the string "ND" should be treated as a special "not available" value, so all of your numeric values get handled properly. When you examine your data, the tiny imprecisions will still be there, but R will print the numbers the same way that Excel does.
I feel like there's probably a better way to approach this (mixed-type columns are really hard to deal with), but if you need to 'fix' the format of the numbers you can try something like this:
x <- c(format(1.2,digits=22),"abc")
## [1] "1.199999999999999955591" "abc"
fix_nums <- function(x) {
nn <- suppressWarnings(as.numeric(x))
x[!is.na(nn)] <- format(nn[!is.na(nn)])
return(x)
}
fix_nums(x)
## [1] "1.2" "abc"
Then if you're using tidyverse you can use my_data %>% mutate_all(fix_nums)
My numbers have “,” for 1,000 and above and R considers it as factor. I want to switch two such variables from factor to numeric (Actually both variables are Numbers, but R considers them as factor for some reason (data is imported from excel). To change a factor variable mydata$x1 to numeric variables I use the following code but it seems not to work properly and some values change, for example it changes 8180 to zero! and it happened many other values as well. Is there other ways to do so without such issues?
mydata$x1<- as.numeric(as.character(mydata$x1))
Since it seems as though the problem is that you have saved your numeric data as characters in Excel (instead of using format to display the commas) you may want a function like this.
#' Replace Commas Function
#'
#' This function converts a character representation of a number that contains a comma separator with a numeric value.
#' #keywords read data
#' #export
replaceCommas<-function(x){
x<-as.numeric(gsub("\\,", "", x))
}
Then
rcffull$RetBackers <- replaceCommas(rcffull$Returning.Backers)
rcffull$NewBackers <- replaceCommas(rcffull$New.Backers)
The reason that G5W is asking for dput ouput is that he (we) are unable to figure out where something that displays as 8180 when it's a factor might not properly be converted with that code. It's not because of leading or trailing spaces (which would not appear in a print-version of a factor. Witness this test:
> as.numeric(as.character(factor(" 8180")))
[1] 8180
> as.numeric(as.character(factor(" 8180 ")))
[1] 8180
And the fact that it gets converted to 0 is a real puzzle since generally items that do not get recognized as parseable R numerics will get coerced to NA (with a warning).
> as.numeric(as.character(factor(" 0 8180 ")))
[1] NA
Warning message:
NAs introduced by coercion
We really need the dput output from the item that displays as "8180" and its neighbors.
Is it possible to write values of different datatypes to a file in R? Currently, I am using a simple vector as follows:
> vect = c (1,2, "string")
> vect
[1] "1" "2" "string"
> write.table(vect, file="/home/sampleuser/sample.txt", append= FALSE, sep= "|")
However, since vect is a vector of string now, opening the file has following contents being in quoted form as:
"x"
"1"|"1"
"2"|"2"
"3"|"string"
Is it not possible to restore the data types of entries 1 and 2 being treated as numeric value instead of string. So my expected result is:
"x"
"1"|1
"2"|2
"3"|"string"
also, I am assuming the left side values "1", "2" and "3" are vector indexes? I did not understand how the first line is "x"?
I wonder if simply removing all the quotes from the output file will solve your problem? That's easy: Add quote=FALSE to your write.table() call.
write.table(vect, file="/home/sampleuser/sample.txt",
append=FALSE, sep="|", quote=FALSE)
x
1|1
2|2
3|string
Also, you can get rid of the column and row names if you like. But now your separator character doesn't appear because you have a one-column table.
write.table(vect, file="/home/sampleuser/sample.txt", append=FALSE, sep="|",
quote=FALSE, row.names=FALSE, col.names=FALSE)
1
2
string
For vectors and matrices, R requires everything to have the same data type. By default, R will coerce all of the data in the vector/matrix into the same format. R will coerce more specific types of data into less specific data types. In this case, any of the items stored in your vector can be reasonably represented as type "character", so it will automatically coerce the numeric parts of the vector to fit that data type.
As #Dason said, you're better off using a list if this isn't something you want.
Alternatively, you can use a data.frame, which lets you store different datatypes in different columns (internally, R stores data.frames as lists, so it makes sense that this would be another option).