I have a table of counts in R:
Deg_Maj.tbl <- table(dat1$Maj,dat1$DegCat)
Deg_Maj.tbl
B D
A 66 5
C 2 9
and a table of percentages:
Deg_Maj.ptbl <- round(prop.table(Deg_Maj.tbl)*100,3)
Deg_Maj.ptbl
B D
A 80.48 6.10
C 2.44 10.98
I need to build a table that looks like this:
B D
A 66(80.48%) 5(6.1%)
C 2(2.44%) 9(10.98%)
I have several tables to do in this manner, so I am hoping to find a nice easy way to accomplish this.
We can use paste
Deg_Maj_out <- Deg_Maj.tbl
Deg_Maj_out[] <- paste0(Deg_Maj.tbl, "(", Deg_Maj.ptbl, "%)")
Deg_Maj_out
# B D
#A 66(80.48%) 5(6.1%)
#C 2(2.44%) 9(10.98%)
Related
I want to do something similar to index match match in Excel, but depending on the column name in dfA and the row name in dfB.
A subset example: dfA (my data) imported from excel and dfB is always the same (molar weight):
dfA <- data.frame(Name=c("A", "B", "C", "D"), #usually imported df from Excel
Asp=c(2,4,6,8),
Glu=c(1,2,3,4),
Arg=c(5,6,7,8))
> dfA
Name Asp Glu Arg
1 A 2 1 5
2 B 4 2 6
3 C 6 3 7
4 D 8 4 8
X <- c("Arg","Asp","Glu")
Y <- c(174,133,147)
dfB <- data.frame(X,Y)
> dfB
X Y
1 Arg 174
2 Asp 133
3 Glu 147
I would like to divide the matching numbers from dfA with dfB, meaning R would "look up" and "take" the value from dfA and divide it with the value that "matches" in dfB.
So for example take the value from sample named A under column "Arg" = 5, and divide it by the row "Arg" in dfB = 174
5 / 174 = 0.029 and make a new data frame called dfC. Looking like below:
#How R would calculate:
Name Asp Glu Arg
1 A 2/133 1/147 5/174
2 B 4/133 2/147 6/174
3 C 6/133 3/147 7/174
4 D 8/133 4/147 8/174
>dfC
Name Asp Glu Arg
1 A 0.015 0.007 0.029
2 B 0.030 0.014 0.034
3 C 0.045 0.020 0.040
4 D 0.060 0.027 0.046
I hope it makes sense :) I am really stuck and have no clear idea, how I can do this easily. I can only think of some weird work arounds, that take much longer than Excel. But I would like to standardize it, so I can use the R script everytime, I get data from the lab.
Here is a way. match the names of dfA, excluding the first with column dfB$X. Then apply a division to both dfA[-1] and dfB$Y. Finally, bind the result with the Name of dfA.
i <- match(names(dfA)[-1], dfB$X)
tmp <- mapply(\(x, y) x/y, dfA[-1], dfB$Y[i])
cbind(dfA[1], tmp)
#> Name Asp Glu Arg
#> 1 A 0.01503759 0.006802721 0.02873563
#> 2 B 0.03007519 0.013605442 0.03448276
#> 3 C 0.04511278 0.020408163 0.04022989
#> 4 D 0.06015038 0.027210884 0.04597701
Created on 2022-09-12 with reprex v2.0.2
Simpler, note the backticks:
tmp <- mapply(`/`, dfA[-1], dfB$Y[i])
Even simpler, do not create the temp matrix.
cbind(dfA[1], mapply(`/`, dfA[-1], dfB$Y[i]))
I want to round these values but they are diverse so I cant set a general rule, like round(pvalue,2). How do I accomplish this?
id <- LETTERS[1:10]
pvalue <- c(0.3,0.0432,0.0032,0.67,0.00000003,0.0069,0.782, 0.0004, 0.00076,0.341)
df <- data.frame(id,pvalue)
df
id pvalue
1 A 0.30000000
2 B 0.04320000
3 C 0.00320000
4 D 0.67000000
5 E 0.00000003
6 F 0.00690000
7 G 0.78200000
8 H 0.00040000
9 I 0.00076000
10 J 0.34100000
It should look like:
id pvalue
1 A 0.3
2 B 0.04
3 C 0.003
4 D 0.67
5 E <0.0001
6 F 0.007
7 G 0.78
8 H 0.0004
9 I 0.0007
10 J 0.34
I think you're using the wrong tool. If you want to prepare p values for scientific display you can use the function pvalString in lazyWeave to convert your numeric values into correctly formatted strings.
library(lazyWeave)
pvalue <- c(0.3,0.0432,0.0032,0.67,0.00000003,0.0069,0.782, 0.0004, 0.00076,0.341)
pvalString(pvalue)
You can edit the parameters to get exactly what you want but the default settings will give you the standard convention.
[1] "0.30" "0.043" "0.003" "0.67" "< 0.001" "0.007" "0.78" "< 0.001" "< 0.001" "0.34"
To take a step back, my ultimate goal is to read in around 130,000 images into R with a pixel size of HxW and then to make a dataframe/datatable containing the rgb of each pixel of each image on a new row. So the output will be something like this:
> head(train_data, 10)
image_no r g b pixel_no
1: 00003e153.jpg 0.11764706 0.1921569 0.3098039 1
2: 00003e153.jpg 0.11372549 0.1882353 0.3058824 2
3: 00003e153.jpg 0.10980392 0.1843137 0.3019608 3
4: 00003e153.jpg 0.11764706 0.1921569 0.3098039 4
5: 00003e153.jpg 0.12941176 0.2039216 0.3215686 5
6: 00003e153.jpg 0.13333333 0.2078431 0.3254902 6
7: 00003e153.jpg 0.12549020 0.2000000 0.3176471 7
8: 00003e153.jpg 0.11764706 0.1921569 0.3098039 8
9: 00003e153.jpg 0.09803922 0.1725490 0.2901961 9
10: 00003e153.jpg 0.11372549 0.1882353 0.3058824 10
I currently have a piece of code to do this in which I apply a function to get the rgb for each pixel of a specified image, returning the result in a dataframe:
#function to get rgb from image file paths
get_rgb_table <- function(link){
img <- readJPEG(toString(link))
# Creating the data frame
rgb_image <- data.frame(r = as.vector(img[1:H, 1:W, 1]),
g = as.vector(img[1:H, 1:W, 2]),
b = as.vector(img[1:H, 1:W, 3]))
#add pixel id
rgb_image$pixel_no <- row.names(rgb_image)
#add image id
train_rgb <- cbind(sub('.*/', '',link),rgb_image)
colnames(train_rgb)[1] <- "image_no"
return(train_rgb)
}
I call this function on another dataframe which contains the links to all the images:
train_files <- list.files(path="~/images/", pattern=".jpg",all.files=T, full.names=T, no.. = T)
train <- data.frame(matrix(unlist(train_files), nrow=length(train_files), byrow=T))
The train dataframe looks like this:
> head(train, 10)
link
1 C:/Documents/image/00003e153.jpg
2 C:/Documents/image/000155de5.jpg
3 C:/Documents/image/00021ddc3.jpg
4 C:/Documents/image/0002756f7.jpg
5 C:/Documents/image/0002d0f32.jpg
6 C:/Documents/image/000303d4d.jpg
7 C:/Documents/image/00031f145.jpg
8 C:/Documents/image/00053c6ba.jpg
9 C:/Documents/image/00057a50d.jpg
10 C:/Documents/image/0005d01c8.jpg
I finally get the result I want with the following loop:
for(i in 1:length(train[,1])){
train_data <- rbind(train_data,get_rgb_table(train[i,1]))
}
However, this last bit of code is very inefficient. An optimization of how the function is applied and and/or the rbind would help. I think the function get_rgb_table() itself is quick but the problem is with the loop and the rbind. I have tried using apply() but can't manage to do this on each row and put the result in one dataframe without running out of memory. Any help on this would be great. Thanks!
This is very difficult to answer given the vagueness of the question, but I'll make a reproducible example of what I think you're asking and will give a solution.
Say I have a function that returns a data frame:
MyFun <- function(x)randu[1:x,]
And I have a data frame df that will act an input to the function.
# a b
# 1 1 21
# 2 2 22
# 3 3 23
# 4 4 24
# 5 5 25
# 6 6 26
# 7 7 27
# 8 8 28
# 9 9 29
# 10 10 30
From your question, it looks like only one column will be used as input. So, I apply the function to each row of this data frame using lapply then I bind the results together using do.call and rbind like this:
do.call(rbind, lapply(df$a, MyFun))
I am trying to solve the following inconvenience when trying to export a table consisting of factor levels. Here is the code to generate the sample data, and a table from it.
data <- c(sample('A',30,replace=TRUE), sample('B',120,replace=TRUE),
sample('C',180,replace=TRUE), sample('D',70,replace=TRUE))
library(Publish)
univariateTable(~data)
The default output of the univariateTable is by levels (From A through D):
Variable Levels Value
1 data A 30 (7.5)
2 B 120 (30.0)
3 C 180 (45.0)
4 D 70 (17.5)
How can I change this so that the output is based on the value instead? I mean, the first row being the largest number (and percentage) and the last low being the lowest, like this:
Variable Levels Value
1 data C 180 (45.0)
2 B 120 (30.0)
3 D 70 (17.5)
4 A 30 (7.5)
Assuming that the "Publish" package is the one installed from github, we extract the numbers before the ( using sub, order it and use it to order the "xlevels" and "summary.totals".
#library(devtools)
#install_github("TagTeam/Publish")
library(Publish)
Out <- univariateTable(~data)
i1 <- order(as.numeric(sub('\\s+.*', '',
Out$summary.totals$data)), decreasing=TRUE)
Out$xlevels$data <- Out$xlevels$data[i1]
Out$summary.totals$data <- Out$summary.totals$data[i1]
Out
# Variable Level Total
#1 data C 180 (45.0)
#2 B 120 (30.0)
#3 D 70 (17.5)
#4 A 30 (7.5)
data
set.seed(24)
data <- c(sample('A',30,replace=TRUE), sample('B',120,replace=TRUE),
sample('C',180,replace=TRUE), sample('D',70,replace=TRUE))
Im having some troubles using factors in functions, or just to make use of them in basic calculations. I have a data-frame something like this (but with as many as 6000 different factors).
df<- data.frame( p <- runif(20)*100,
q = sample(1:100,20, replace = T),
tt = c("e","e","f","f","f","i","h","e","i","i","f","f","j","j","h","h","h","e","j","i"),
ta = c("a","a","a","b","b","b","a","a","c","c","a","b","a","a","c","c","b","a","c","b"))
colnames(df)<-c("p","q","ta","tt")
Now price = p and quantity = q are my variables, and tt and ta are different factors.
Now, I would first like to find the average price per unit of q by each different factor in tt
(p*q ) / sum(q) by tt
This would in this case give me a list of 3 different sums, by a, b and c (I have 6000 different factors so I need to do it smart :) ).
I have tried using split to make lists, and in this case i can get each individual tt factor to contain the prices and another for the quantity, but I cant seem to get them to for example make an average. I've also tried to use tapply, but again I can't see how I can incorporate factors into this?
EDIT: I can see I need to clearify:
I need to find 3 sums, the average price pr. q given each factor, so in this simplified case it would be:
a: Sum of p*q for (Row (1,2,3, 7, 11, 13,14,18) / sum (q for row Row (1,2,3, 7, 11, 13,14,18)
So the result should be the average price for a, b and c, which is just 3 values.
I'd use plyr to do this:
library(plyr)
ddply(df, .(tt), mutate, new_col = (p*q) / sum(q))
p q ta tt new_col
1 73.92499 70 e a 11.29857879
2 58.49011 60 e a 7.66245932
3 17.23246 27 f a 1.01588711
4 64.74637 42 h a 5.93743967
5 55.89372 45 e a 5.49174103
6 25.87318 83 f a 4.68880732
7 12.35469 23 j a 0.62043207
8 1.19060 83 j a 0.21576367
9 84.18467 25 e a 4.59523322
10 73.59459 66 f b 10.07726727
11 26.12099 99 f b 5.36509998
12 25.63809 80 i b 4.25528535
13 54.74334 90 f b 10.22178577
14 69.45430 50 h b 7.20480246
15 52.71006 97 i b 10.60762667
16 17.78591 54 i c 5.16365066
17 0.15036 41 i c 0.03314388
18 85.57796 30 h c 13.80289670
19 54.38938 44 h c 12.86630433
20 44.50439 17 j c 4.06760541
plyr does have a reputation for being slow, data.table provides similar functionality, but much higher performance.
If I understood corectly you'r problem this should be the answer. Give it a try and responde, that I can adjust it if it's needed.
myRes <- function(tt) {
out <- NULL;
qsum <- sum(as.numeric(df[,"q"]))
psum <- sum(as.numeric(df[,"p"]))
for (var in tt) {
index <- which(df["tt"] == var)
out <- c(out, ((qsum *psum) / sum(df[index,"q"])))
}
return (out)
}
threeValue <- myRes(levels(df[, "tt"]));