R concate 2 numeric columns with big numbers - r

I want to concate 2 columns with numbers and get as result a number.
Example:
First column: 123456
Second column: 78910
Desired Result: 12345678910
test<-matrix(
c(328897771052600448,4124523780886268),
nrow=1,
ncol=2
)
test<-data.frame(test)
str(test)
Both columns are numeric
colnames(test)<-c("post_visid_high","post_visid_low")
test_2<-transform(test,visit_id=as.numeric(paste0(post_visid_high,post_visid_low)))
Problem:
My concated result gives: 3.288977710526004289528e+33
I dont understand why I get this (incorrect??) number.
When I exlcude "as.numeric" I get the right result:
test_2<-transform(test,visit_id=paste0(post_visid_high,post_visid_low))
test_2
But its converted into "factor":
str(test_2)

These numbers are to large to be stored exactly as numeric. You can either store them as string by specifying stringsAsFactors = FALSE:
test_2<-transform(test,visit_id=paste0(post_visid_high,post_visid_low), stringsAsFactors = FALSE)
test_2
#> post_visid_high post_visid_low visit_id
#> 1 3.288978e+17 4.124524e+15 3288977710526004484124523780886268
str(test_2)
#> 'data.frame': 1 obs. of 3 variables:
#> $ post_visid_high: num 3.29e+17
#> $ post_visid_low : num 4.12e+15
#> $ visit_id : chr "3288977710526004484124523780886268"
Or you use something like gmp to process arbitrary sized integers:
library(gmp)
test_3 <- test
test_3$visit_id <- as.bigz(paste0(test_3$post_visid_high, test_3$post_visid_low))
test_3
#> post_visid_high post_visid_low visit_id
#> 1 3.288978e+17 4.124524e+15 3288977710526004484124523780886268
str(test_3)
#> 'data.frame': 1 obs. of 3 variables:
#> $ post_visid_high: num 3.29e+17
#> $ post_visid_low : num 4.12e+15
#> $ visit_id : 'bigz' raw 3288977710526004484124523780886268

Related

colnames() function in R - Treating table values as independant objects/variables

I have a list of values which I would like to use as names for separate tables scraped from separate URLs on a certain website.
> Fac_table
[[1]]
[1] "fulltime_fac_table"
[[2]]
[1] "parttime_fac_table"
[[3]]
[1] "honorary_fac_table"
[[4]]
[1] "retired_fac_table"
I would like to loop through the list to automatically generate 4 tables with the respective names.
The result should look like this:
> fulltime_fac_table
職稱
V1 "教授兼系主任"
V2 "教授"
V3 "教授"
V4 "教授"
V5 "特聘教授"
> parttime_fac_table
職稱 姓名
V1 "教授" "XXX"
V2 "教授" "XXX"
V3 "教授" "XXX"
V4 "教授" "XXX"
V5 "教授" "XXX"
V6 "教授" "XXX"
I have another list, named 'headers', containing column headings of the respective tables online.
> headers
[[1]]
[1] "職稱" "姓名" "    研究領域"
[4] "聯絡方式"
[[2]]
[1] "職稱" "姓名" "研究領域" "聯絡方式"
I was able to assign values to the respective tables with this code:
> assign(eval(parse(text="Fac_table[[i]]")), as_tibble(matrix(fac_data,
> nrow = length(headers[[i]])))
This results in a populated table, without column headings, like this one:
> honorary_fac_table
[,1] [,2]
V1 "名譽教授" "XXX"
V2 "名譽教授" "XXX"
V3 "名譽教授" "XXX"
V4 "名譽教授" "XXX"
But was unable to assign column names to each table.
Neither of the code below worked:
> assign(colnames(eval(parse(text="Fac_table[1]"))), c(gsub("\\s", "", headers[[1]])))
Error in assign(colnames(eval(parse(text = "Fac_table[1]"))), c(gsub("\\s", :
第一個引數不正確
> colnames(eval(parse(text="Fac_table[i]"))) <- c(gsub("\\s", "", headers[[i]]))
Error in colnames(eval(parse(text = "Fac_table[i]"))) <- c(gsub("\\s", :
賦值目標擴充到非語言的物件
> do.call("<-", colnames(eval(parse(text="Fac_table[i]"))), c(gsub("\\s", "", headers[[i]])))
Error in do.call("<-", colnames(eval(parse(text = "Fac_table[i]"))), c(gsub("\\s", :
second argument must be a list
To simplify the issue, a reproducible example is as follows:
> varNamelist <- list(c("tbl1","tbl2","tbl3","tbl4"))
> colHeaderlist <- list(c("col1","col2","col3","col4"))
> tableData <- matrix([1:12], ncol=4)
This works:
> assign(eval(parse(text="varNamelist[[1]][1]")), matrix(tableData, ncol
> = length(colHeaderlist[[1]])))
But this doesn't:
> colnames(as.name(varNamelist[[1]][1])) <- colHeaderlist[[1]]
Error in `colnames<-`(`*tmp*`, value = c("col1", "col2", "col3", "col4" :
attempt to set 'colnames' on an object with less than two dimensions
It seems like the colnames() function in R is unable to treat the strings as represented by "Fac_table[i]" as variable names, in which independent data (separate from Fac_table) can be stored.
> colnames(as.name(Fac_table[[1]])) <- headers[[1]]
Error in `colnames<-`(`*tmp*`, value = c("a", "b", "c", :
attempt to set 'colnames' on an object with less than two dimensions
Substituting for 'fulltime_fac_table' directly works fine.
> colnames(fulltime_fac_table) <- headers[[1]]
Is there any way around this issue?
Thanks!
There is a solution to this, but I think the current set up may be more complex than necessary if I understand correctly. So I'll try to make this task easier.
If you're working with one-dimensional data, I'd recommend using vectors, as they're more appropriate than lists for that purpose. So for this project, I'd begin by storing the names of tables and headers, like this:
varNamelist <- c("tbl1","tbl2","tbl3","tbl4")
colHeaderlist <- c("col1","col2","col3","col4")
It's still difficult to determine what the data format and origin for the input of these table is from your question, but in general, sometimes a data frame can be easier to work with than a matrix, as long as your not working with Big Data. The assign function is also typically not necessary for these sort of steps. Instead, when setting up a dataframe, we can apply the name of the data frame, the name of the columns, and the data contents all at once, like this:
tbl1 <- data.frame("col1"=c(1,2,3),
"col2"=c(4,5,6),
"col3"=c(7,8,9),
"col4"=c(10,11,12))
Again, we're using vectors, noted by the c() instead of list(), to fill each column since each column is it's own single dimension.
To check the output of tbl1, we can then use print():
print(tbl1)
col1 col2 col3 col4
1 1 4 7 10
2 2 5 8 11
3 3 6 9 12
If it's an option to create the tables closer to this way shown, that might make things easier than using so many lists and assign functions; that quickly becomes overly complicated.
But if you want at the end to store all the tables in a single place, you could put them in a list:
tableList <– list(tbl1=tbl1,tbl2=tbl2,tbl3=tbl3,tbl4=tbl4)
str(tableList)
List of 4
$ tbl1:'data.frame': 3 obs. of 4 variables:
..$ col1: num [1:3] 1 2 3
..$ col2: num [1:3] 4 5 6
..$ col3: num [1:3] 7 8 9
..$ col4: num [1:3] 10 11 12
$ tbl2:'data.frame': 3 obs. of 4 variables:
..$ col1: num [1:3] 1 2 3
..$ col2: num [1:3] 4 5 6
..$ col3: num [1:3] 7 8 9
..$ col4: num [1:3] 10 11 12
$ tbl3:'data.frame': 3 obs. of 4 variables:
..$ col1: num [1:3] 1 2 3
..$ col2: num [1:3] 4 5 6
..$ col3: num [1:3] 7 8 9
..$ col4: num [1:3] 10 11 12
$ tbl4:'data.frame': 3 obs. of 4 variables:
..$ col1: num [1:3] 1 2 3
..$ col2: num [1:3] 4 5 6
..$ col3: num [1:3] 7 8 9
..$ col4: num [1:3] 10 11 12
I've found a work around solution based on #Ryan's recommendation, given by this code:
for (i in seq_along(url)){
webpage <- read_html(url[i]) #loop through URL list to access html data
fac_data <- html_nodes(webpage,'.tableunder') %>% html_text()
fac_data1 <- html_nodes(webpage,'.tableunder1') %>% html_text()
fac_data <- c(fac_data, fac_data1) #Store table data on each URL in a variable
x <- fac_data %>% matrix(ncol = length(headers[[i]]), byrow=TRUE) #make matrix to extract column data
for (j in seq_along(headers[[i]])){
y <- cbind(x[,j]) #extract column data and store in temporary variable
colnames(y) <- as.character(headers[[i]][j]) #add column name
print(cbind(y)) #loop through headers list to print column data in sequence. ** cbind(y) will be overwritten when I try to store the result on a list with 'z <- cbind(y)'.
}
}
I am now able to print out all values, complete with headers of the data in question.
Follow-up questions have been posted here.
The final code solved this problem as well.

Wilcoxon test in R - x must be numeric error

I have a problem with the wilcox.test in R.
My data object is a matrix in which the first column contains a name, and all other columns contain a (gene expression) measurement, which is numeric:
str(myMatrix)
'data.frame': 2000 obs. of 143 variables:
$ precursor : chr "name1" "name2" "name3" "name4" ...
$ sample1: num 1.46e-03 2.64e+02 1.46e-03 1.46e-03 1.46e-03 ...
$ sample2: num 1.46e-03 1.91e+02 1.46e-03 1.46e-03 1.46e-03 ...
$ sample3: num 1.46e-03 3.01e+02 1.46e-03 1.46e-03 4.96 ...
For all of the 2000 rows I want to test whether there is a difference between 2 given parts of the matrix. I tried this in 4 different ways:
wilcox.test(as.numeric(myMatrix[i,2:87],myMatrix[i,88:98]))$p.value
#[1] 1.549484e-16
wilcox.test(myMatrix[i,2:87],myMatrix[i,88:98])$p.value
#Error in wilcox.test.default(myMatrix[i, 2:87], myMatrix[i, 88:98]) :
#'x' must be numeric
t.test(as.numeric(myMatrix[i,2:87],myMatrix[i,88:98]))$p.value
#[1] 0.2973957
t.test(myMatrix[i,2:87],myMatrix[i,88:98])$p.value
#[1] 0.3098505
So as you can see, only if I use as.numeric() on the already numeric values I get a result without an error message for the Wilcoxon test, but the results completely differ from t.test results even if they should not.
Manually verifying by using an online tool shows that the t.test results using as.numeric() values are wrong.
Any suggestions about how I can solve this problem and do the correct Wilcoxon test? If you need more information let me know.
Actually, myMatrix [i, 2:87] is still a data.frame. See the following example.
> myMat
fir X1 X2 X3 X4
1 name1 1 5 9 13
2 name2 2 6 10 14
3 name3 3 7 11 15
4 name4 4 8 12 16
> class(myMat[1, 2:4])
[1] "data.frame"
> as.numeric(myMat[1, 2:4])
[1] 1 5 9
Change your data to a real Matrix will solve your problem.
> myMat_01 <- myMat[, 2:5]
> rownames(myMat_01) <- myMat$fir
> myMat_01 <- as.matrix(myMat_01)
> class(myMat_01[1, 2:4])
[1] "integer"

Replace integer(0) by NA

I have a function that I apply to a column and puts results in another column and it sometimes gives me integer(0) as output. So my output column will be something like:
45
64
integer(0)
78
How can I detect these integer(0)'s and replace them by NA? Is there something like is.na() that will detect them ?
Edit: Ok I think I have a reproducible example:
df1 <-data.frame(c("267119002","257051033",NA,"267098003","267099020","267047006"))
names(df1)[1]<-"ID"
df2 <-data.frame(c("257051033","267098003","267119002","267047006","267099020"))
names(df2)[1]<-"ID"
df2$vals <-c(11,22,33,44,55)
fetcher <-function(x){
y <- df2$vals[which(match(df2$ID,x)==TRUE)]
return(y)
}
sapply(df1$ID,function(x) fetcher(x))
The output from this sapply is the source of the problem.
> str(sapply(df1$ID,function(x) fetcher(x)))
List of 6
$ : num 33
$ : num 11
$ : num(0)
$ : num 22
$ : num 55
$ : num 44
I don't want this to be a list - I want a vector, and instead of num(0) I want NA (note in this toy data it gives num(0) - in my real data it gives (integer(0)).
Here's a way to (a) replace integer(0) with NA and (b) transform the list into a vector.
# a regular data frame
> dat <- data.frame(x = 1:4)
# add a list including integer(0) as a column
> dat$col <- list(45,
+ 64,
+ integer(0),
+ 78)
> str(dat)
'data.frame': 4 obs. of 2 variables:
$ x : int 1 2 3 4
$ col:List of 4
..$ : num 45
..$ : num 64
..$ : int
..$ : num 78
# find zero-length values
> idx <- !(sapply(dat$col, length))
# replace these values with NA
> dat$col[idx] <- NA
# transform list to vector
> dat$col <- unlist(dat$col)
# now the data frame contains vector columns only
> str(dat)
'data.frame': 4 obs. of 2 variables:
$ x : int 1 2 3 4
$ col: num 45 64 NA 78
Best to do that in your function, I'll call it myFunctionForApply but that's your current function. Before you return, check the length and if it is 0 return NA:
myFunctionForApply <- function(x, ...) {
# Do your processing
# Let's say it ends up in variable 'ret':
if (length(ret) == 0)
return(NA)
return(ret)
}

daply: Correct results, but confusing structure

I have a data.frame mydf, that contains data from 27 subjects. There are two predictors, congruent (2 levels) and offset (5 levels), so overall there are 10 conditions. Each of the 27 subjects was tested 20 times under each condition, resulting in a total of 10*27*20 = 5400 observations. RT is the response variable. The structure looks like this:
> str(mydf)
'data.frame': 5400 obs. of 4 variables:
$ subject : Factor w/ 27 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
$ congruent: logi TRUE FALSE FALSE TRUE FALSE TRUE ...
$ offset : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 5 5 1 2 5 5 2 2 3 5 ...
$ RT : int 330 343 457 436 302 311 595 330 338 374 ...
I've used daply() to calculate the mean RT of each subject in each of the 10 conditions:
myarray <- daply(mydf, .(subject, congruent, offset), summarize, mean = mean(RT))
The result looks just the way I wanted, i.e. a 3d-array; so to speak 5 tables (one for each offset condition) that show the mean of each subject in the congruent=FALSE vs. the congruent=TRUE condition.
However if I check the structure of myarray, I get a confusing output:
List of 270
$ : num 417
$ : num 393
$ : num 364
$ : num 399
$ : num 374
...
# and so on
...
[list output truncated]
- attr(*, "dim")= int [1:3] 27 2 5
- attr(*, "dimnames")=List of 3
..$ subject : chr [1:27] "1" "2" "3" "5" ...
..$ congruent: chr [1:2] "FALSE" "TRUE"
..$ offset : chr [1:5] "1" "2" "3" "4" ...
This looks totally different from the structure of the prototypical ozone array from the plyr package, even though it's a very similar format (3 dimensions, only numerical values).
I want to compute some further summarizing information on this array, by means of aaply. Precisely, I want to calculate the difference between the congruent and the incongruent means for each subject and offset.
However, already the most basic application of aaply() like aaply(myarray,2,mean) returns non-sense output:
FALSE TRUE
NA NA
Warning messages:
1: In mean.default(piece, ...) :
argument is not numeric or logical: returning NA
2: In mean.default(piece, ...) :
argument is not numeric or logical: returning NA
I have no idea, why the daply() function returns such weirdly structured output and thereby prevents any further use of aaply. Any kind of help is kindly appreciated, I frankly admit that I have hardly any experience with the plyr package.
Since you haven't included your data it's hard to know for sure, but I tried to make a dummy set off your str(). You can do what you want (I'm guessing) with two uses of ddply. First the means, then the difference of the means.
#Make dummy data
mydf <- data.frame(subject = rep(1:5, each = 150),
congruent = rep(c(TRUE, FALSE), each = 75),
offset = rep(1:5, each = 15), RT = sample(300:500, 750, replace = T))
#Make means
mydf.mean <- ddply(mydf, .(subject, congruent, offset), summarise, mean.RT = mean(RT))
#Calculate difference between congruent and incongruent
mydf.diff <- ddply(mydf.mean, .(subject, offset), summarise, diff.mean = diff(mean.RT))
head(mydf.diff)
# subject offset diff.mean
# 1 1 1 39.133333
# 2 1 2 9.200000
# 3 1 3 20.933333
# 4 1 4 -1.533333
# 5 1 5 -34.266667
# 6 2 1 -2.800000

Why is the class and mode of the object returned by matrix() and array() the same?

Below are the first few rows of my large data file:
Symbol|Security Name|Market Category|Test Issue|Financial Status|Round Lot Size
AAC|Australia Acquisition Corp. - Ordinary Shares|S|N|D|100
AACC|Asset Acceptance Capital Corp. - Common Stock|Q|N|N|100
AACOU|Australia Acquisition Corp. - Unit|S|N|N|100
AACOW|Australia Acquisition Corp. - Warrant|S|N|N|100
AAIT|iShares MSCI All Country Asia Information Technology Index Fund|G|N|N|100
AAME|Atlantic American Corporation - Common Stock|G|N|N|100
I read the data in:
data <- read.table("nasdaqlisted.txt", sep="|", quote='', header=TRUE, as.is=TRUE)
and construct an array and a matrix:
d1 <- array(data, dim=c(nrow(data), ncol(data)))
d2 <- matrix(data, nrow=nrow(data), ncol=ncol(data))
However, even though d1 is an array and d2 is a matrix, the class and mode are the same:
> class(d1)
[1] "matrix"
> mode(d1)
[1] "list"
> class(d2)
[1] "matrix"
> mode(d2)
[1] "list"
Why is this?
I'll bite and have a go at explaining my understanding of the issues.
You don't need your large test file to demonstrate the issue. A simple data.frame would do:
test <- data.frame(var1=1:2,var2=letters[1:2])
> test
var1 var2
1 1 a
2 2 b
Keep in mind that a data.frame is just a list internally.
> is.data.frame(test)
[1] TRUE
> is.list(test)
[1] TRUE
With a list-like structure as you would expect.
> str(test)
'data.frame': 2 obs. of 2 variables:
$ var1: int 1 2
$ var2: Factor w/ 2 levels "a","b": 1 2
> str(as.list(test))
List of 2
$ var1: int [1:2] 1 2
$ var2: Factor w/ 2 levels "a","b": 1 2
When you specify a matrix call against a data.frame or a list, you end up with a matrix filled with the elements of the data.frame or list.
result1 <- matrix(test)
> result1
[,1]
[1,] Integer,2
[2,] factor,2
Looking at the structure of result1, you can see it is still a list, but now just with dimensions (see the last line in the output below).
> str(result1)
List of 2
$ : int [1:2] 1 2
$ : Factor w/ 2 levels "a","b": 1 2
- attr(*, "dim")= int [1:2] 2 1
Which means it is now both a matrix and a list
> is.matrix(result1)
[1] TRUE
> is.list(result1)
[1] TRUE
If you strip the dimensions from this object, it will no longer be a matrix and will revert to just being a list.
dim(result1) <- NULL
> result1
[[1]]
[1] 1 2
[[2]]
[1] a b
Levels: a b
> is.matrix(result1)
[1] FALSE
> is.list(result1)
[1] TRUE
> str(result1)
List of 2
$ : int [1:2] 1 2
$ : Factor w/ 2 levels "a","b": 1 2

Resources