i have a data frame with different variables. i have merged three columns of probabilities to my data frame. my question is how can i use these columns as probabilities in sample function so that for prob argument take each column as probability. for example for y= 1 take ncol (a) , for y=1 take ncol(b) and so on my codes are:
a b c y
1 0.090 0.12 0.10 1
2 0.015 0.13 0.09 1
3 0.034 0.20 0.34 1
4 0.440 0.44 0.70 1
5 0.090 0.12 0.10 2
6 0.015 0.13 0.09 2
mydata$mig<- sample( 1:3, size = 7, replace = TRUE, prob= ????)
any help would be appreciated
Using apply function per rows:
df <- read.table(header = TRUE, text="a b c y
1 0.090 0.12 0.10 1
2 0.015 0.13 0.09 1
3 0.034 0.20 0.34 1
4 0.440 0.44 0.70 1
5 0.090 0.12 0.10 2
6 0.015 0.13 0.09 2")
set.seed(12344)
samples1<- apply(X = df[,-4], MARGIN = 1, # MARGIN = 1 indicates you are applying FUN per rows
FUN = function(x) sample( 1:3,
size = 7,
replace= TRUE ,
prob = x))
#You obtain six columns from samples with prob parameter in df's rows
samples1
1 2 3 4 5 6
[1,] 2 3 3 1 3 2
[2,] 1 2 3 3 2 2
[3,] 2 3 3 3 1 3
[4,] 2 3 3 1 3 2
[5,] 2 2 3 2 3 2
[6,] 1 3 2 3 2 3
[7,] 3 3 3 2 1 2
Update:
Given your comment on my answer, I update and propose a new solution using data.table. I leave the previous version for reference if there will be anyone interested.
library(data.table)
setDT(df)
set.seed(78787)
#Column V1 has your 7 samples per group y, with probs taken at random from a,b,c
df[, sample(1:.N,
size = 7,
replace = TRUE,
prob = unlist(.SD)),
by = y,
.SDcols = sample(names(df)[-ncol(df)], 1)]
y V1
1: 1 4
2: 1 3
3: 1 4
4: 1 3
5: 1 4
6: 1 4
7: 1 4
8: 2 2
9: 2 1
10: 2 1
11: 2 1
12: 2 2
13: 2 1
14: 2 1
While the "normal" indexing of 2d matrices and frames is the [i,j] method, one can also provide a 2-column matrix to i alone to programmatically combine the rows and columns. We can use this to create a matrix whose first column is merely counting the rows (1:6 here), and the second column is taken directly from your y column:
cbind(seq_len(nrow(mydata)), mydata$y)
# [,1] [,2]
# [1,] 1 1
# [2,] 2 1
# [3,] 3 1
# [4,] 4 1
# [5,] 5 2
# [6,] 6 2
mydata[cbind(seq_len(nrow(mydata)), mydata$y)]
# [1] 0.090 0.015 0.034 0.440 0.120 0.130
Note that in this case, your sample-ing code is not going to work:
true --> TRUE
the length of derived probabilities is not the same length as your 1:3
Related
My data frame looks like this
value <- c(0,0.1,0.2,0.4,0,"0.05,",0.05,0.5,0.20,0.40,0.50,0.60)
time <- c(1,1,"1,",1,2,2,2,2,3,3,3,3)
ID <- c("1,","2,","3,",4,1,2,3,4,1,2,3,4)
test <- data.frame(value, time, ID)
test
value time ID
1 0 1 1,
2 0.1 1 2,
3 0.2 1, 3,
4 0.4 1 4
5 0 2 1
6 0.05, 2 2
7 0.05 2 3
8 0.5 2 4
9 0.2 3 1
10 0.4 3 2
11 0.5 3 3
12 0.6 3 4
I want to replace the "," from all columns with "" but I am still getting an error
Error in UseMethod("tbl_vars") :
no applicable method for 'tbl_vars' applied to an object of class "character"
I would like my data to look like this
value time ID
1 0.00 1 1
2 0.10 1 2
3 0.20 1 3
4 0.40 1 4
5 0.00 2 1
6 0.05 2 2
7 0.05 2 3
8 0.50 2 4
9 0.20 3 1
10 0.40 3 2
11 0.50 3 3
12 0.60 3 4
EDIT
test %>%
mutate_all(~gsub(",","",.))
The easiest in this case might be to use parse_number from the readr package,
e.g. :
apply(test, 2, readr::parse_number)
or in dplyr lingo:
test %>% mutate_all(readr::parse_number)
A simple base Rsolution:
test <- sapply(test, function(x) as.numeric(sub(",", "", x)))
test
value time ID
[1,] 0.00 1 1
[2,] 0.10 1 2
[3,] 0.20 1 3
[4,] 0.40 1 4
[5,] 0.00 2 1
[6,] 0.05 2 2
[7,] 0.05 2 3
[8,] 0.50 2 4
[9,] 0.20 3 1
[10,] 0.40 3 2
[11,] 0.50 3 3
[12,] 0.60 3 4
test %>%
mutate_at(vars(value, time, ID), ~ gsub(".*?(-?[0-9]+\\.?[0-9]*).*", "\\1", .))
# value time ID
# 1 0 1 1
# 2 0.1 1 2
# 3 0.2 1 3
# 4 0.4 1 4
# 5 0 2 1
# 6 0.05 2 2
# 7 0.05 2 3
# 8 0.5 2 4
# 9 0.2 3 1
# 10 0.4 3 2
# 11 0.5 3 3
# 12 0.6 3 4
The more we get into the "let's try to parse what could be a number", it can get crazy, including scientific notation. For that, readr::parse_number already suggested is likely a better candidate if you can accept one more package dependency.
However ... seeing this suggests that either the method of import has some mistakes in it, or however the data is formed has mistakes in it. While this patch works on those kinds of mistakes, it is far better to fix whichever error is causing this.
Problem: How can I fill backwards all rows in a group before an occurrence of a certain value. I am not trying to fill in NA or missing value using zoo na.locf. In the following I would like to fill all previous rows in A with 1.00 before the 1.00 occurs by each ID group, ideally using dplyr.
Input:
data<- data.frame(ID=c(1,1,1,1,2,2,2,3,3,3,4,4,4,4,4),
time=c(1,2,3,4,1,2,3,1,2,3,1,2,3,4,5),
A=c(0.10,0.25,1,0,0.25,1,0.25,0,1,0.10,1,0.10,0.10,0.10,0.05))
ID time A
1 1 0.10
1 2 0.25
1 3 1.00
1 4 0.00
2 1 0.25
2 2 1.00
2 3 0.25
3 1 0.00
3 2 1.00
3 3 0.10
4 1 1.00
4 2 0.10
4 3 0.10
4 4 0.10
4 5 0.05
Desired output:
ID time A
1 1 1.00
1 2 1.00
1 3 1.00
1 4 0.00
2 1 1.00
2 2 1.00
2 3 0.25
3 1 1.00
3 2 1.00
3 3 0.10
4 1 1.00
4 2 0.10
4 3 0.10
4 4 0.10
4 5 0.05
After grouping by ID you can check the cumulative sum of 1's and where it's still below 1 (not yet appeared), replace the A-value with 1:
data %>%
group_by(ID) %>%
mutate(A = replace(A, cumsum(A == 1) < 1, 1))
# Source: local data frame [15 x 3]
# Groups: ID [4]
#
# ID time A
# <dbl> <dbl> <dbl>
# 1 1 1 1.00
# 2 1 2 1.00
# 3 1 3 1.00
# 4 1 4 0.00
# 5 2 1 1.00
# 6 2 2 1.00
# 7 2 3 0.25
# 8 3 1 1.00
# 9 3 2 1.00
# 10 3 3 0.10
# 11 4 1 1.00
# 12 4 2 0.10
# 13 4 3 0.10
# 14 4 4 0.10
# 15 4 5 0.05
Quite similar, you could also use cummax:
data %>% group_by(ID) %>% mutate(A = replace(A, !cummax(A == 1), 1))
And here's a base R approach:
transform(data, A = ave(A, ID, FUN = function(x) replace(x, !cummax(x == 1), 1)))
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data)), get the row where 'A' is 1, find the sequence of rows, use that as i to assign (:=) the values in 'A' to 1
library(data.table)
setDT(data)[data[, .I[seq_len(which(A==1))], ID]$V1, A := 1][]
# ID time A
# 1: 1 1 1.00
# 2: 1 2 1.00
# 3: 1 3 1.00
# 4: 1 4 0.00
# 5: 2 1 1.00
# 6: 2 2 1.00
# 7: 2 3 0.25
# 8: 3 1 1.00
# 9: 3 2 1.00
#10: 3 3 0.10
#11: 4 1 1.00
#12: 4 2 0.10
#13: 4 3 0.10
#14: 4 4 0.10
#15: 4 5 0.05
Or we can use ave from base R
data$A[with(data, ave(A==1, ID, FUN = cumsum)<1)] <- 1
(reproducible example given) The function causfinder::causalitycombinations below:
causalitycombinations <- function (nvars, ncausers, ndependents)
{
independents <- combn(nvars, ncausers)
swingnumber <- dim(combn(nvars - ncausers, ndependents))[[2]]
numberofallcombinations <- dim(combn(nvars, ncausers))[[2]] * swingnumber
dependents <- matrix(, nrow = dim(combn(nvars, ncausers))[[2]] * swingnumber, ncol = ndependents)
for (i in as.integer(1:dim(combn(nvars, ncausers))[[2]])) {
dependents[(swingnumber * (i - 1) + 1):(swingnumber * i), ] <- t(combn(setdiff(seq(1:nvars), independents[, i]), ndependents))
}
swingedindependents <- matrix(, nrow = dim(combn(nvars, ncausers))[[2]] * swingnumber, ncol = ncausers)
for (i in as.integer(1:dim(combn(nvars, ncausers))[[2]])) {
for (j in as.integer(1:swingnumber)) {
swingedindependents[(i - 1) * swingnumber + j, ] <- independents[, i]
}
}
independentsdependents <- cbind(swingedindependents, dependents)
others <- matrix(, nrow = dim(combn(nvars, ncausers))[[2]] * swingnumber, ncol = nvars - ncausers - ndependents)
for (i in as.integer(1:((dim(combn(nvars, ncausers))[[2]]) *
swingnumber))) {
others[i, ] <- setdiff(seq(1:nvars), independentsdependents[i, ])
}
causalitiestemplate <- cbind(independentsdependents, others)
causalitiestemplate
}
lists all the multivariate causality combinations. For example, in a 4-variable system, conditioned on the other 2 variables of the system, they are (when variables are assigned to numbers 1,2,3,4 and this assignment is kept throughout the analysis):
causalitycombinations(4,1,1)
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 1 3 2 4
[3,] 1 4 2 3
[4,] 2 1 3 4
[5,] 2 3 1 4
[6,] 2 4 1 3 # to check whether 2nd var Grangercauses 4th var condioned on 1 and 3
[7,] 3 1 2 4
[8,] 3 2 1 4
[9,] 3 4 1 2
[10,] 4 1 2 3
[11,] 4 2 1 3
[12,] 4 3 1 2
Now,
data.frame(from = causalitycombinations(4,1,1)[,1], to= causalitycombinations(4,1,1)[,2],
pval = c(0.5,0.6,0.1, #I just typed random p-vals here
0.4,0.8,0.2,
0.1,0.5,0.9,
0.0,0.0,0.1)
)
produces:
from to pval
1 1 2 0.5
2 1 3 0.6
3 1 4 0.1
4 2 1 0.4
5 2 3 0.8
6 2 4 0.2
7 3 1 0.1
8 3 2 0.5
9 3 4 0.9
10 4 1 0.0
11 4 2 0.0
12 4 3 0.1
In the above "from" and "to" columns' entries, I wanna print variables' names (say: "inf", "gdp", "exc", "stock") instead of their representative numbers (i.e., 1,2,3,4). How to achieve this?
Equivalently, how to list combinations with strings instead of numbers
We can update columns with matching names by position from string vector:
# update columns with matching name
df1$from <- c("inf", "gdp", "exc", "stock")[df1$from]
df1$to <- c("inf", "gdp", "exc", "stock")[df1$to]
# result
df1
# from to pval
# 1 inf gdp 0.5
# 2 inf exc 0.6
# 3 inf stock 0.1
# 4 gdp inf 0.4
# 5 gdp exc 0.8
# 6 gdp stock 0.2
# 7 exc inf 0.1
# 8 exc gdp 0.5
# 9 exc stock 0.9
# 10 stock inf 0.0
# 11 stock gdp 0.0
# 12 stock exc 0.1
# input data
df1 <- read.table(text=" from to pval
1 1 2 0.5
2 1 3 0.6
3 1 4 0.1
4 2 1 0.4
5 2 3 0.8
6 2 4 0.2
7 3 1 0.1
8 3 2 0.5
9 3 4 0.9
10 4 1 0.0
11 4 2 0.0
12 4 3 0.1", header = TRUE)
Consider this data:
m = data.frame(pop=c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4),
id=c(0,1,1,1,1,1,0,2,1,1,1,2,1,2,2,2))
> m
pop id
1 1 0
2 1 1
3 1 1
4 1 1
5 2 1
6 2 1
7 2 0
8 2 2
9 2 1
10 3 1
11 3 1
12 3 2
13 3 1
14 3 2
15 4 2
16 4 2
I would like to get the frequency of each unique id in each unique pop? For example, the id 1 is present 3 times out of 4 when pop == 1, therefore the frequency of id 1 in pop 1 is 0.75.
I came up with this ugly solution:
out = matrix(0,ncol=3)
for (p in unique(m$pop))
{
for (i in unique(m$id))
{
m1 = m[m$pop == p,]
f = nrow(m1[m1$id == i,])/nrow(m1)
out = rbind(out, c(p, f, i))
}
}
out = out[-1,]
colnames(out) = c("pop", "freq", "id")
# SOLUTION
> out
pop freq id
[1,] 1 0.25 0
[2,] 1 0.75 1
[3,] 1 0.00 2
[4,] 2 0.20 0
[5,] 2 0.60 1
[6,] 2 0.20 2
[7,] 3 0.00 0
[8,] 3 0.60 1
[9,] 3 0.40 2
[10,] 4 0.00 0
[11,] 4 0.00 1
[12,] 4 1.00 2
I am sure there exists a more efficient solution using data.table or table but couldn't find it.
Here's what I might do:
as.data.frame(prop.table(table(m),1))
# pop id Freq
# 1 1 0 0.25
# 2 2 0 0.20
# 3 3 0 0.00
# 4 4 0 0.00
# 5 1 1 0.75
# 6 2 1 0.60
# 7 3 1 0.60
# 8 4 1 0.00
# 9 1 2 0.00
# 10 2 2 0.20
# 11 3 2 0.40
# 12 4 2 1.00
If you want it sorted by pop, you can do that afterwards. Alternately, you could transpose the table with t before converting to data.frame; or use rev(m) and prop.table on dimension 2.
Try:
library(dplyr)
m %>%
group_by(pop, id) %>%
summarise(s = n()) %>%
mutate(freq = s / sum(s)) %>%
select(-s)
Which gives:
#Source: local data frame [8 x 3]
#Groups: pop
#
# pop id freq
#1 1 0 0.25
#2 1 1 0.75
#3 2 0 0.20
#4 2 1 0.60
#5 2 2 0.20
#6 3 1 0.60
#7 3 2 0.40
#8 4 2 1.00
A data.table solution:
setDT(m)[, {div = .N; .SD[, .N/div, keyby = id]}, by = pop]
# pop id V1
#1: 1 0 0.25
#2: 1 1 0.75
#3: 2 0 0.20
#4: 2 1 0.60
#5: 2 2 0.20
#6: 3 1 0.60
#7: 3 2 0.40
#8: 4 2 1.00
With
df <- data.frame(x = rep(1:3, each = 3)
, y = rep(1:3, 3)
, z = round(rnorm(9), 2))
df
x y z
1 1 1 0.55
2 1 2 0.99
3 1 3 -2.32
4 2 1 -0.25
5 2 2 1.20
6 2 3 -0.38
7 3 1 1.07
8 3 2 -0.98
9 3 3 -1.09
Is there a way to sort z within each x so that:
df.sort
x y z
1 1 3 -2.32
2 1 1 0.55
3 1 2 0.99
4 2 3 -0.38
5 2 1 -0.25
6 2 2 1.20
7 3 3 -1.09
8 3 2 -0.98
9 3 1 1.07
Thanks!
If you want to sort by z within each value of x ( what your example shows, not really what your question seems to lead towards, you can use plyr and arrange
library(plyr)
dfa <- arrange(df, x, z)
What you are doing here is ordering first by x, then by z
You could create a new data.frame on the fly.
data.frame(df$x, df[order(df$z), c("y", "z")])