Forgive my lack of R knowledge. I am running some statistics, however I have some problems with the number of decimals in the output. The table I use is simple, inlcuding 2 colums of 'text' and two 'numeric'. The table shows 5 digits (3 decimals). However when working with this table R studio only gives 1 decimal. Not only in my lsmeans results but already in my head(X).
I already tried the following (where X is data):
>format(X, digits=5)
>format(X, decimals=3)
>print(lsmeans,decimals=3)
>options(digits = 5)
However the columns N and Dm are still rounded to 1 decimal.
> N 92.4 92.4 93.7 .....
> Dm 44.8 51.2 49.0 ....
> lsmean 92.7 93.3 92.2
I would like to see the columns N and Dm with 3 decimals like (I see them at the table when used view(x)), and likewise the results of N of the lsmean.
Example data:
X <- structure(list(Diet = structure(c(1L, 1L, 1L, 1L, 1L, 1L),
.Label = c("1",
"2", "3", "4"),
class = c("ordered", "factor")),
Room = structure(c(1L,
1L, 1L, 1L, 1L, 2L),
.Label = c("1", "2"), class = c("ordered",
"factor")),
Ndigestibility = c(92.3961026914675, 91.3131265857907,
93.7094576131358, 93.1557358031795,
91.6853770290382, 93.2698082975574),
Dmdigestibility = c(44.7692224966736, 51.2173172537712,
49.0100980168149, 45.6289084300095,
45.9036710781654, 45.3144774487225)),
row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
X
# Diet Room Ndigestibility Dmdigestibility
# 1 1 1 92.39610 44.76922
# 2 1 1 91.31313 51.21732
# 3 1 1 93.70946 49.01010
# 4 1 1 93.15574 45.62891
# 5 1 1 91.68538 45.90367
# 6 1 2 93.26981 45.31448
You can do
emm_options(opt.digits = FALSE)
This will disable the feature in the emmeans package (for which lsmeans is a front end) whereby results are displayed in reasonable precision relative to their standard errors.
Related
I have a data.frame object in R and need to:
Group by col_1
Select rows from col_3 such that col_2 value is the second largest one (if there is only observation for the given value of col_1, return 'NA' for instance).
How can I obtain this?
Example:
scored xg first_goal scored_mane
1 1 1.03212 Lallana 0
2 1 2.06000 Mane 1
3 2 2.38824 Robertson 1
4 2 1.64291 Mane 1
Group by "scored_mane", return values from "scored" where "xg" is the second largest. Expected output: "NA", 1
You can try the following base R solution, using aggregate + merge
res <- merge(aggregate(xg~scored_mane,df,function(v) sort(v,decreasing = T)[2]),df,all.x = TRUE)[,"scored"]
such that
> res
[1] NA 1
DATA
structure(list(scored = c(1L, 1L, 2L, 2L), xg = c(1.03212, 2.06,
2.38824, 1.64291), first_goal = c("Lallana", "Mane", "Robertson",
"Mane"), scored_mane = c(0L, 1L, 1L, 1L)), class = "data.frame", row.names = c("1",
"2", "3", "4")) -> df
I have a data frame where I have a column name Rooms which holds the number of rooms in the house. It has about 50,000+ rows and I checked it using str(df$Rooms) and it is a factor with 44 levels. The column looks like this :
>str(df$Rooms)
Factor w/ 44 levels "","1","1+1","1+2",..: 20 32 23 27 28 29 27 23 26 24 ...
> df$Rooms
1+2
3
1+3
1+2
4
3
1+1
2
..
..
My question is there any way or any functions or library in R that can be used to get the value of these equations. Maybe so that it can become something like this :
> df$Rooms
3
3
4
3
4
3
2
2
..
..
Thank you in advance~
We can use eval parse
df$final_rooms <- sapply(as.character(df$Rooms), function(x) eval(parse(text = x)))
df
# Rooms final_rooms
#1 1+2 3
#2 3 3
#3 1+3 4
#4 1+2 3
#5 4 4
#6 3 3
#7 1+1 2
#8 2 2
data
df <- structure(list(Rooms = structure(c(2L, 5L, 3L, 2L, 6L, 5L, 1L,
4L), .Label = c("1+1", "1+2", "1+3", "2", "3", "4"), class = "factor")),
class = "data.frame", row.names = c(NA, -8L))
We can split by the + and do a sum after converting to numeric without using the eval(parse in base R
df$final_rooms <- sapply(strsplit(as.character(df$Rooms) , "+",
fixed = TRUE), function(x) sum(as.numeric(x)))
Or another option is to read with read.table into two columns and do a rowSums with vectorized option
df$final_rooms <- rowSums(read.table(text = as.character(df$Rooms),
sep="+", header = FALSE, fill = TRUE), na.rm = TRUE)
df$final_rooms
#[1] 3 3 4 3 4 3 2 2
data
df <- structure(list(Rooms = structure(c(2L, 5L, 3L, 2L, 6L, 5L, 1L,
4L), .Label = c("1+1", "1+2", "1+3", "2", "3", "4"), class = "factor")),
class = "data.frame", row.names = c(NA, -8L))
I am interested in testing some network visualization techniques but before trying those functions I want to build an adjacency matrix (from, to) using the dataframe which is as follows.
Id Gender Col_Cold_1 Col_Cold_2 Col_Cold_3 Col_Hot_1 Col_Hot_2 Col_Hot_3
10 F pain sleep NA infection medication walking
14 F Bump NA muscle NA twitching flutter
17 M pain hemoloma Callus infection
18 F muscle pain twitching medication
My goal is to create an adjacency matrix as follows
1) All values in columns with keyword Cold will contribute to the rows
2) All values in columns with keyword Hot will contribute to the columns
For example, pain, sleep, Bump, muscle, hemaloma are cell values under the columns with keyword Cold and they will form the rows and cell values such as infection, medication, Callus, walking, twitching, flutter are under columns with keywords Hot and this will form the columns of the association matrix.
The final desired output should appear like this:
infection medication walking twitching flutter Callus
pain 2 2 1 1 1
sleep 1 1 1
Bump 1 1
muscle 1 1
hemaloma 1 1
[pain, infection] = 2 because the association between pain and infection occurs twice in the original dataframe: once in row 1 and again in row 3.
[pain, medication]=2 because association between pain and medication occurs twice once in row 1 and again in row 4.
Any suggestions or advice on producing such an association matrix is much appreciated thanks.
Reproducible Dataset
df = structure(list(id = c(10, 14, 17, 18), Gender = structure(c(1L, 1L, 2L, 1L), .Label = c("F", "M"), class = "factor"), Col_Cold_1 = structure(c(4L, 2L, 1L, 3L), .Label = c("", "Bump", "muscle", "pain"), class = "factor"), Col_Cold_2 = structure(c(4L, 2L, 3L, 1L), .Label = c("", "NA", "pain", "sleep"), class = "factor"), Col_Cold_3 = structure(c(1L, 3L, 2L, 4L), .Label = c("NA", "hemaloma", "muscle", "pain" ), class = "factor"), Col_Hot_1 = structure(c(4L, 3L, 2L, 1L), .Label = c("", "Callus", "NA", "infection"), class = "factor"), Col_Hot_2 = structure(c(2L, 3L, 1L, 3L), .Label = c("infection", "medication", "twitching"), class = "factor"), Col_Hot_3 = structure(c(4L, 2L, 1L, 3L), .Label = c("", "flutter", "medication", "walking" ), class = "factor")), .Names = c("id", "Gender", "Col_Cold_1", "Col_Cold_2", "Col_Cold_3", "Col_Hot_1", "Col_Hot_2", "Col_Hot_3" ), row.names = c(NA, -4L), class = "data.frame")
One way is to make the dataset into a "tidy" form, then use xtabs. First, some cleaning up:
df[] <- lapply(df, as.character) # Convert factors to characters
df[df == "NA" | df == "" | is.na(df)] <- NA # Make all blanks NAs
Now, tidy the dataset:
library(tidyr)
library(dplyr)
out <- do.call(rbind, sapply(grep("^Col_Cold", names(df), value = T), function(x){
vars <- c(x, grep("^Col_Hot", names(df), value = T))
setNames(gather_(select(df, one_of(vars)),
key_col = x,
value_col = "value",
gather_cols = vars[-1])[, c(1, 3)], c("cold", "hot"))
}, simplify = FALSE))
The idea is to "pair" each of the "cold" columns with each of the "hot" columns to make a long dataset. out looks like this:
out
# cold hot
# 1 pain infection
# 2 Bump <NA>
# 3 <NA> Callus
# 4 muscle <NA>
# 5 pain medication
# ...
Finally, use xtabs to make the desired output:
xtabs(~ cold + hot, na.omit(out))
# hot
# cold Callus flutter infection medication twitching walking
# Bump 0 1 0 0 1 0
# hemaloma 1 0 1 0 0 0
# muscle 0 1 0 1 2 0
# pain 1 0 2 2 1 1
# sleep 0 0 1 1 0 1
I have a data.frame 'data', where one column contains integer values between 1:100, which are coded values for the Isolate they represent.
Here's my example data, 'data':
Size Isolate spin
1 primary 3 up
2 primary 4 down
3 sec 6 strange
4 ter 1 charm
5 sec 3 bottom
6 quart 2 top
I have another data.frame that contains the key between the integers and the name of the Isolate
1 alpha
2 bravo
3 charlie
4 delta
5 echo
6 foxtrot
7 golf
This list is 100 Isolates in length, too much to type in by hand with if/else.
I'd like to know an easy solution to replacing the integers in my first data.frame, whic aren't in ascending order as you can see, with the corresponding Isolate names in the second data.frame.
I tried, after researching:
data$Isolate <- as.numeric(factor(data$Isolate,
levels =c("alpha","bravo","charlie","delta","echo","foxtrot","golf")
)
)
but this just replaced the Isolate column with N/A.
As Hubert said in the comments, this is a simple use-case for merge.
Let's say the column names of your second "key" data frame are "Isolate" and "Isolate_Name", then it's as easy as
merge(data, key_data, by = "Isolate")
The default is for an "inner join" which will only keep records that have matches. If you're worried about losing records that don't have matches you can add the argument all.x = TRUE.
If you prefer non-base packages, this is easy in data.table or dplyr as well.
Using factor, you could try:
data$Isolate <- factor(data$Isolate,
levels=1:7,
labels =c("alpha","bravo","charlie","delta","echo","foxtrot","golf"))
If you have many levels that are already in their own data.frame, you could automate this.
data$Isolate <- factor(data$Isolate,levels=code$No,labels=code$Value)
With your second data.frame, code:
code <- read.table(text="1 alpha
2 bravo
3 charlie
4 delta
5 echo
6 foxtrot
7 golf",stringsAsFactor=FALSE)
names(code) <- c("No","Value")
df$Isolate <- df2[,1][df$Isolate]
# Size Isolate spin
# 1 primary charlie up
# 2 primary delta down
# 3 sec foxtrot strange
# 4 ter alpha charm
# 5 sec charlie bottom
# 6 quart bravo top
You can subset the lookup data frame by the target data frame.
Data
df <- structure(list(Size = structure(c(1L, 1L, 3L, 4L, 3L, 2L), .Label = c("primary",
"quart", "sec", "ter"), class = "factor"), Isolate = c(3L, 4L,
6L, 1L, 3L, 2L), spin = structure(c(6L, 3L, 4L, 2L, 1L, 5L), .Label = c("bottom",
"charm", "down", "strange", "top", "up"), class = "factor")), .Names = c("Size",
"Isolate", "spin"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
df2 <- structure(list(V2 = structure(1:7, .Label = c("alpha", "bravo",
"charlie", "delta", "echo", "foxtrot", "golf"), class = "factor")), .Names = "V2", class = "data.frame", row.names = c(NA,
-7L))
DATA AND REQUIREMENTS
The first table (myMatrix1) is from an old geological survey that used different region boundaries (begin and finish) columns to the newer survey.
What I wish to do is to match the begin and finish boundaries and then create two tables one for the new data on sedimentation and one for the new data on bore width characterised as a boolean.
myMatrix1 <- read.table("/path/to/file")
myMatrix2 <- read.table("/path/to/file")
> head(myMatrix1) # this is the old data
sampleIDs begin finish
1 19990224 4 5
2 20000224 5 6
3 20010203 6 8
4 20019024 29 30
5 20020201 51 52
> head(myMatrix2) # this is the new data
begin finish sedimentation boreWidth
1 0 10 1.002455 0.014354
2 11 367 2.094351 0.056431
3 368 920 0.450275 0.154105
4 921 1414 2.250820 1.004353
5 1415 5278 0.114109 NA`
Desired output:
> head(myMatrix6)
sampleIDs begin finish sedimentation #myMatrix4
1 19990224 4 5 1.002455
2 20000224 5 6 1.002455
3 20010203 6 8 2.094351
4 20019024 29 30 2.094351
5 20020201 51 52 2.094351
> head(myMatrix7)
sampleIDs begin finish boreWidthThresh #myMatrix5
1 19990224 4 5 FALSE
2 20000224 5 6 FALSE
3 20010203 6 8 FALSE
4 20019024 29 30 FALSE
5 20020201 51 52 FALSE`
CODE
The following code has taken me several hours to run on my dataset (about 5 million data points). Is there any way to change the code to make it run any faster?
# create empty matrix for sedimentation
myMatrix6 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix6) <- letters[1:4]
# create empty matrix for bore
myMatrix7 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix7) <- letters[1:4]
for (i in 1:nrow(myMatrix2))
{
# create matrix that has the value of myMatrix1$begin being
# situated between the values of myMatrix2begin[i] and myMatrix2finish[i]
myMatrix3 <- myMatrix1[which((myMatrix1$begin > myMatrix2$begin[i]) & (myMatrix1$begin < myMatrix2$finish[i])),]
myMatrix4 <- rep(myMatrix2$sedimentation, nrow(myMatrix3))
if (is.na(myMatrix2$boreWidth[i])) {
myMatrix5 <- rep(NA, nrow(myMatrix3))
}
else if (myMatrix2$boreWidth[i] == 0) {
myMatrix5 <- rep(TRUE, nrow(myMatrix3))
}
else if (myMatrix2$boreWidth[i] > 0) {
myMatrix5 <- rep(FALSE, nrow(myMatrix3))
}
myMatrix6 <- rbind(myMatrix6, cbind(myMatrix3, myMatrix4))
myMatrix7 <- rbind(myMatrix7, cbind(myMatrix3, myMatrix5))
}
EDIT:
> dput(head(myMatrix2)
structure(list(V1 = structure(c(6L, 1L, 2L, 4L, 5L, 3L), .Label = c("0",
"11", "1415", "368", "921", "begin"), class = "factor"), V2 = structure(c(6L,
1L, 3L, 5L, 2L, 4L), .Label = c("10", "1414", "367", "5278",
"920", "finish"), class = "factor"), V3 = structure(c(6L, 3L,
4L, 2L, 5L, 1L), .Label = c("0.114109", "0.450275", "1.002455",
"2.094351", "2.250820", "sedimentation"), class = "factor"),
V4 = structure(c(5L, 1L, 2L, 3L, 4L, 6L), .Label = c("0.014354",
"0.056431", "0.154105", "1.004353", "boreWidth", "NA"), class = "factor")), .Names = c("V1",
"V2", "V3", "V4"), row.names = c(NA, 6L), class = "data.frame")
> dput(head(myMatrix1)
structure(list(V1 = structure(c(6L, 1L, 2L, 3L, 4L, 5L), .Label = c("19990224",
"20000224", "20010203", "20019024", "20020201", "sampleIDs"), class = "factor"),
V2 = structure(c(6L, 2L, 3L, 5L, 1L, 4L), .Label = c("29",
"4", "5", "51", "6", "begin"), class = "factor"), V3 = structure(c(6L,
2L, 4L, 5L, 1L, 3L), .Label = c("30", "5", "52", "6", "8",
"finish"), class = "factor")), .Names = c("V1", "V2", "V3"
), row.names = c(NA, 6L), class = "data.frame")
First look at these general suggestions on speeding up code: https://stackoverflow.com/a/8474941/636656
The first thing that jumps out at me is that I'd create only one results matrix. That way you're not duplicating the sampleIDs begin finish columns, and you can avoid any overhead that comes with running the matching algorithm twice.
Doing that, you can avoid selecting more than once (although it's trivial in terms of speed as long as you store your selection vector rather than re-calculate).
Here's a solution using apply:
myMatrix1 <- data.frame(sampleIDs=c(19990224,20000224),begin=c(4,5),finish=c(5,6))
myMatrix2 <- data.frame(begin=c(0,11),finish=c(10,367),sed=c(1.002,2.01),boreWidth=c(.014,.056))
glommer <- function(x,myMatrix2) {
x[4:5] <- as.numeric(myMatrix2[ myMatrix2$begin <= x["begin"] & myMatrix2$finish >= x["finish"], c("sed","boreWidth") ])
names(x)[4:5] <- c("sed","boreWidth")
return( x )
}
> t(apply( myMatrix1, 1, glommer, myMatrix2=myMatrix2))
sampleIDs begin finish sed boreWidth
[1,] 19990224 4 5 1.002 0.014
[2,] 20000224 5 6 1.002 0.014
I used apply and stored everything as numeric. Other approaches would be to return a data.frame and have the sampleIDs and begin, finish be ints. That might avoid some problems with floating point error.
This solution assumes there are no boundary cases (e.g. the begin, finish times of myMatrix1 are entirely contained within the begin, finish times of the other). If your data is more complicated, just change the glommer() function. How you want to handle that is a substantive question.