gsub() on header in data frame - r

I have a file that looks like
X90045GridMs.TotPFPrc X90045Inv.TmpLimStt X90042InvCtl.Stt X90042Mode
1 NA NA NA NA
2 0.00 1 3 7
3 0.44 1 2 1
4 0.80 1 2 1
5 0.88 1 2 1
6 0.93 1 2 1
7 0.95 1 2 1
8 0.98 1 2 1
9 0.99 1 2 1
where the headers are made up of a serial no. and a parameter name. I would like to change the headers from X90045 and X90042 to Inv 1 and Inv 2 using gsub. Is there such a method to use gsub on the header? The end result should look something like this:
Inv1GridMs.TotPFPrc Inv1Inv.TmpLimStt Inv2InvCtl.Stt Inv2Mode
1 NA NA NA NA
2 0.00 1 3 7
3 0.44 1 2 1
4 0.80 1 2 1
5 0.88 1 2 1
6 0.93 1 2 1
7 0.95 1 2 1
8 0.98 1 2 1
9 0.99 1 2 1

Is your data in a data.frame object? If so you can access and modify the header using names().
names(yourdata) <- gsub("X90045", "Inv1", names(yourdata))
and likewise for your other field.

Related

Replace special strings from all columns of data.frame in R using dplyR

My data frame looks like this
value <- c(0,0.1,0.2,0.4,0,"0.05,",0.05,0.5,0.20,0.40,0.50,0.60)
time <- c(1,1,"1,",1,2,2,2,2,3,3,3,3)
ID <- c("1,","2,","3,",4,1,2,3,4,1,2,3,4)
test <- data.frame(value, time, ID)
test
value time ID
1 0 1 1,
2 0.1 1 2,
3 0.2 1, 3,
4 0.4 1 4
5 0 2 1
6 0.05, 2 2
7 0.05 2 3
8 0.5 2 4
9 0.2 3 1
10 0.4 3 2
11 0.5 3 3
12 0.6 3 4
I want to replace the "," from all columns with "" but I am still getting an error
Error in UseMethod("tbl_vars") :
no applicable method for 'tbl_vars' applied to an object of class "character"
I would like my data to look like this
value time ID
1 0.00 1 1
2 0.10 1 2
3 0.20 1 3
4 0.40 1 4
5 0.00 2 1
6 0.05 2 2
7 0.05 2 3
8 0.50 2 4
9 0.20 3 1
10 0.40 3 2
11 0.50 3 3
12 0.60 3 4
EDIT
test %>%
mutate_all(~gsub(",","",.))
The easiest in this case might be to use parse_number from the readr package,
e.g. :
apply(test, 2, readr::parse_number)
or in dplyr lingo:
test %>% mutate_all(readr::parse_number)
A simple base Rsolution:
test <- sapply(test, function(x) as.numeric(sub(",", "", x)))
test
value time ID
[1,] 0.00 1 1
[2,] 0.10 1 2
[3,] 0.20 1 3
[4,] 0.40 1 4
[5,] 0.00 2 1
[6,] 0.05 2 2
[7,] 0.05 2 3
[8,] 0.50 2 4
[9,] 0.20 3 1
[10,] 0.40 3 2
[11,] 0.50 3 3
[12,] 0.60 3 4
test %>%
mutate_at(vars(value, time, ID), ~ gsub(".*?(-?[0-9]+\\.?[0-9]*).*", "\\1", .))
# value time ID
# 1 0 1 1
# 2 0.1 1 2
# 3 0.2 1 3
# 4 0.4 1 4
# 5 0 2 1
# 6 0.05 2 2
# 7 0.05 2 3
# 8 0.5 2 4
# 9 0.2 3 1
# 10 0.4 3 2
# 11 0.5 3 3
# 12 0.6 3 4
The more we get into the "let's try to parse what could be a number", it can get crazy, including scientific notation. For that, readr::parse_number already suggested is likely a better candidate if you can accept one more package dependency.
However ... seeing this suggests that either the method of import has some mistakes in it, or however the data is formed has mistakes in it. While this patch works on those kinds of mistakes, it is far better to fix whichever error is causing this.

Taking means over `sam` and `dup`

I am trying to take the means over the columns sam and dup of the following dataset:
fat co lab sam dup
1 0.62 1 1 1 1
2 0.55 1 1 1 2
3 0.34 1 1 2 1
4 0.24 1 1 2 2
5 0.80 1 1 3 1
6 0.68 1 1 3 2
7 0.76 1 1 4 1
8 0.65 1 1 4 2
9 0.30 1 2 1 1
10 0.40 1 2 1 2
11 0.33 1 2 2 1
12 0.43 1 2 2 2
13 0.39 1 2 3 1
14 0.40 1 2 3 2
15 0.29 1 2 4 1
16 0.18 1 2 4 2
17 0.46 1 3 1 1
18 0.38 1 3 1 2
19 0.27 1 3 2 1
20 0.37 1 3 2 2
21 0.37 1 3 3 1
22 0.42 1 3 3 2
23 0.45 1 3 4 1
24 0.54 1 3 4 2
25 0.18 2 1 1 1
26 0.47 2 1 1 2
27 0.53 2 1 2 1
28 0.32 2 1 2 2
29 0.40 2 1 3 1
30 0.37 2 1 3 2
31 0.31 2 1 4 1
32 0.43 2 1 4 2
33 0.35 2 2 1 1
34 0.39 2 2 1 2
35 0.37 2 2 2 1
36 0.33 2 2 2 2
37 0.42 2 2 3 1
38 0.36 2 2 3 2
39 0.20 2 2 4 1
40 0.41 2 2 4 2
41 0.37 2 3 1 1
42 0.43 2 3 1 2
43 0.28 2 3 2 1
44 0.36 2 3 2 2
45 0.18 2 3 3 1
46 0.20 2 3 3 2
47 0.26 2 3 4 1
48 0.06 2 3 4 2
The output should be this:
lab co fat
1 1 1 0.58000
2 2 1 0.34000
3 3 1 0.40750
4 1 2 0.37625
5 2 2 0.35375
6 3 2 0.26750
These are both in the form of .RData files.
How can this be done?
An example with part of the data you posted:
dt = read.table(text = "
fat co lab sam dup
0.62 1 1 1 1
0.55 1 1 1 2
0.34 1 1 2 1
0.24 1 1 2 2
0.80 1 1 3 1
0.68 1 1 3 2
0.76 1 1 4 1
0.65 1 1 4 2
0.30 1 2 1 1
0.40 1 2 1 2
0.33 1 2 2 1
0.43 1 2 2 2
0.39 1 2 3 1
0.40 1 2 3 2
0.29 1 2 4 1
0.18 1 2 4 2
", header= T)
library(dplyr)
dt %>%
group_by(lab, co) %>% # for each lab and co combination
summarise(fat = mean(fat)) %>% # get the mean of fat
ungroup() # forget the grouping
# # A tibble: 2 x 3
# lab co fat
# <int> <int> <dbl>
# 1 1 1 0.58
# 2 2 1 0.34

How can I calculate and update bayesian discrete probabilities?

I have 100 rows and 5 columns, each row is a sequence of 5 integers between 1 and 3 like
X1 X2 X3 X4 X5
1 2 2 1 3 1
2 2 1 2 3 2
3 2 3 1 2 2
4 2 3 1 1 3
5 2 1 2 3 1
6 2 1 2 2 2
7 2 2 2 1 2
8 2 1 3 2 2
9 2 2 2 2 2
10 2 2 1 2 3
11 2 3 2 1 1
12 2 1 3 2 1
13 2 1 3 2 3
14 2 2 2 3 3
15 2 3 2 2 1
16 2 1 2 2 3
17 2 3 3 1 2
18 2 1 3 1 3
.....
How can I calculate the probability of the whole sequence of the row just having the first integer and update looking at the second and the third etc...
More info
There is no dependency between each row and each column but
I have the probability of each integer (1,2,3) to be in each column : 1 (0.2) 2 (0.2) 3 (0.6) and the probability of each unique series in a row is:
11223 0.01
12311 0.01
13121 0.01
13233 0.01
13313 0.01
13323 0.01
.........
33233 0.02
33313 0.01
33323 0.01
33331 0.01
33332 0.03
33333 0.16
Each draw is independent from previous draw.
Instead of integers 1 2 3 I should talk about A B and C to avoid confusion ...
this is a categorical Bayesian probability (I think :-) )

How do I sort one vector based on values of another (with data.frame)

I have a data frame ‘true set’, that I would like to sort based on the order of values in vectors ‘order’.
true_set <- data.frame(dose1=c(rep(1,5),rep(2,5),rep(3,5)), dose2=c(rep(1:5,3)),toxicity=c(0.05,0.1,0.15,0.3,0.45,0.1,0.15,0.3,0.45,0.55,0.15,0.3,0.45,0.55,0.6),efficacy=c(0.2,0.3,0.4,0.5,0.6,0.4,0.5,0.6,0.7,0.8,0.5,0.6,0.7,0.8,0.9),d=c(1:15))
orders<-matrix(nrow=3,ncol=15)
orders[1,]<-c(1,2,6,3,7,11,4,8,12,5,9,13,10,14,15)
orders[2,]<-c(1,6,2,3,7,11,12,8,4,5,9,13,14,10,15)
orders[3,]<-c(1,6,2,11,7,3,12,8,4,13,9,5,14,10,15)
The expected result would be:
First orders[1,] :
dose1 dose2 toxicity efficacy d
1 1 1 0.05 0.2 1
2 1 2 0.10 0.3 2
3 2 1 0.10 0.4 6
4 1 3 0.15 0.4 3
5 2 2 0.15 0.5 7
6 3 1 0.15 0.5 11
7 1 4 0.30 0.5 4
8 2 3 0.30 0.6 8
9 3 2 0.30 0.6 12
10 1 5 0.45 0.6 5
11 2 4 0.45 0.7 9
12 3 3 0.45 0.7 13
13 2 5 0.55 0.8 10
14 3 4 0.55 0.8 14
15 3 5 0.60 0.9 15
First orders[2,] : as above
First orders[3,] : as above
true_set <- data.frame(dose1=c(rep(1,5),rep(2,5),rep(3,5)), dose2=c(rep(1:5,3)),toxicity=c(0.05,0.1,0.15,0.3,0.45,0.1,0.15,0.3,0.45,0.55,0.15,0.3,0.45,0.55,0.6),efficacy=c(0.2,0.3,0.4,0.5,0.6,0.4,0.5,0.6,0.7,0.8,0.5,0.6,0.7,0.8,0.9),d=c(1:15))
orders<-matrix(nrow=3,ncol=15)
orders[1,]<-c(1,2,6,3,7,11,4,8,12,5,9,13,10,14,15)
orders[2,]<-c(1,6,2,3,7,11,12,8,4,5,9,13,14,10,15)
orders[3,]<-c(1,6,2,11,7,3,12,8,4,13,9,5,14,10,15)
# Specify your order set in the row dimension
First_order <- true_set[orders[1,],]
Second_order <- true_Set[orders[2,],]
Third_order <- true_Set[orders[3,],]
# If you want to store all orders in a list, you can try the command below:
First_orders <- list(First_Order=true_set[orders[1,],],Second_Order=true_set[orders[2,],],Third_Order=true_set[orders[3,],])
First_orders[1] # OR First_orders$First_Order
First_orders[2] # OR First_orders$Second_Order
First_orders[3] # OR First_orders$Third_Order
# If you want to combine the orders column wise, try the command below:
First_orders <- cbind(First_Order=true_set[orders[1,],],Second_Order=true_set[orders[2,],],Third_Order=true_set[orders[3,],])
# If you want to combine the orders row wise, try the command below:
First_orders <- rbind(First_Order=true_set[orders[1,],],Second_Order=true_set[orders[2,],],Third_Order=true_set[orders[3,],])

For each `pop` get frequencies of the elements of `id`

Consider this data:
m = data.frame(pop=c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4),
id=c(0,1,1,1,1,1,0,2,1,1,1,2,1,2,2,2))
> m
pop id
1 1 0
2 1 1
3 1 1
4 1 1
5 2 1
6 2 1
7 2 0
8 2 2
9 2 1
10 3 1
11 3 1
12 3 2
13 3 1
14 3 2
15 4 2
16 4 2
I would like to get the frequency of each unique id in each unique pop? For example, the id 1 is present 3 times out of 4 when pop == 1, therefore the frequency of id 1 in pop 1 is 0.75.
I came up with this ugly solution:
out = matrix(0,ncol=3)
for (p in unique(m$pop))
{
for (i in unique(m$id))
{
m1 = m[m$pop == p,]
f = nrow(m1[m1$id == i,])/nrow(m1)
out = rbind(out, c(p, f, i))
}
}
out = out[-1,]
colnames(out) = c("pop", "freq", "id")
# SOLUTION
> out
pop freq id
[1,] 1 0.25 0
[2,] 1 0.75 1
[3,] 1 0.00 2
[4,] 2 0.20 0
[5,] 2 0.60 1
[6,] 2 0.20 2
[7,] 3 0.00 0
[8,] 3 0.60 1
[9,] 3 0.40 2
[10,] 4 0.00 0
[11,] 4 0.00 1
[12,] 4 1.00 2
I am sure there exists a more efficient solution using data.table or table but couldn't find it.
Here's what I might do:
as.data.frame(prop.table(table(m),1))
# pop id Freq
# 1 1 0 0.25
# 2 2 0 0.20
# 3 3 0 0.00
# 4 4 0 0.00
# 5 1 1 0.75
# 6 2 1 0.60
# 7 3 1 0.60
# 8 4 1 0.00
# 9 1 2 0.00
# 10 2 2 0.20
# 11 3 2 0.40
# 12 4 2 1.00
If you want it sorted by pop, you can do that afterwards. Alternately, you could transpose the table with t before converting to data.frame; or use rev(m) and prop.table on dimension 2.
Try:
library(dplyr)
m %>%
group_by(pop, id) %>%
summarise(s = n()) %>%
mutate(freq = s / sum(s)) %>%
select(-s)
Which gives:
#Source: local data frame [8 x 3]
#Groups: pop
#
# pop id freq
#1 1 0 0.25
#2 1 1 0.75
#3 2 0 0.20
#4 2 1 0.60
#5 2 2 0.20
#6 3 1 0.60
#7 3 2 0.40
#8 4 2 1.00
A data.table solution:
setDT(m)[, {div = .N; .SD[, .N/div, keyby = id]}, by = pop]
# pop id V1
#1: 1 0 0.25
#2: 1 1 0.75
#3: 2 0 0.20
#4: 2 1 0.60
#5: 2 2 0.20
#6: 3 1 0.60
#7: 3 2 0.40
#8: 4 2 1.00

Resources