expected number in from data in data.frame in R - r

I want to turn this equation into an R code: ((e^-mean)(mean^i)/i!)XN; where i = index and N is sample size.
What I have is this:
x["expected92"]<-((exp(-me92))(me92^(x$multX1992))/(x$multX1992));
I want to create a new column that goes through the index and makes the expected mean.
example data:
Drag 1992 multX1992
0 113 0
1 30 30
3 15 30
example of wanted output:
Drag 1992 multX1992 expected92
0 113 0 90.03
1 30 30 58.80
3 15 30 19.20
Can someone help fix my code?

Related

Mapping dataframe column values to a n by n matrix

I'm trying to map column values of a data.frame object (consisting of large number of bilateral trade data among 161 countries) to a 161 x 161 adjacency matrix (also of data.frame class) such that each cell represents the dyadic trade flows between any two countries.
The data looks like this
# load the data from dropbox folder
library(foreign)
example_data <- read.csv("https://www.dropbox.com/s/hf0ga22tdjlvdvr/example_data.csv?dl=1")
head(example_data, n = 10)
rid pid TradeValue
1 2 3 500
2 2 7 2328
3 2 8 2233465
4 2 9 81470
5 2 12 572893
6 2 17 488374
7 2 19 3314932
8 2 23 20323
9 2 25 10
10 2 29 9026220
length(unique(example_data$rid))
[1] 139
length(unique(example_data$pid))
[1] 161
where rid is reporter id, pid is (trade) partner id, a country's rid and pid are the same. The same id(s) in the rid column are matched with multiple rows in the pid column in terms of TradeValue.
However, there are some problems with this data. First, because countries (usually developing countries) that did not report trade statistics have no data to be extracted, their id(s) are absent in the rid column (such as country 1). On the other hand, those country id(s) may enter into pid column through other countries' reporting (in which case, the reporters tend to be developed countries). Hence, the rid column only contains some of the country id (only 139 out of 161), while the pid column has all 161 country id.
What I'm attempting to do is to map this example_data dataframe to a 161 x 161 adjacency matrix using rid for row and pid for column where each cell represent the TradeValue between any two country id. To this end, there are a couple things I need to tackle with:
Fill in those country id(s) that are missing in the rid column of example_data and, temporarily, set all cell values in their respective rows to 0.
By previous step, impute those "0" cells using bilateral trade statistics reported by other countries; if the corresponding statistics are still unavailable, leave those "0" cells as they are.
For example, for a 5-country dataframe of the following form
rid pid TradeValue
2 1 50
2 3 45
2 4 7
2 5 18
3 1 24
3 2 45
3 4 88
3 5 12
5 1 27
5 2 18
5 3 12
5 4 92
The desired output should look like this
pid_1 pid_2 pid_3 pid_4 pid_5
rid_1 0 50 24 0 27
rid_2 50 0 45 7 18
rid_3 24 45 0 88 12
rid_4 0 7 88 0 92
rid_5 27 18 12 92 0
but on top of my mind, I could not figure out how to. It will be really appreciated if someone can help me on this.
df1$rid = factor(df1$rid, levels = 1:5, labels = paste("rid",1:5,sep ="_"))
df1$pid = factor(df1$pid, levels = 1:5, labels = paste("pid",1:5,sep ="_"))
data.table::dcast(df1, rid ~ pid, fill = 0, drop = FALSE, value.var = "TradeValue")
# rid pid_1 pid_2 pid_3 pid_4 pid_5
#1 rid_1 0 0 0 0 0
#2 rid_2 50 0 45 7 18
#3 rid_3 24 45 0 88 12
#4 rid_4 0 0 0 0 0
#5 rid_5 27 18 12 92 0
The secrets/ tricks:
use factor variables to tell R what values are all possible as well as the order.
in data.tables dcast use fill = 0 (fill zero where you have nothing), drop = FALSE (make entries for factor levels that aren't observed)

Bootstrapping multiple columns with R

I'm relatively new at R and I'm trying to build a function which will loop through columns in an imported table and produce an output which consists of the means and 95% confidence intervals. Ideally it should be possible to bootstrap columns with different sample sizes, but first I would like to get the iteration working. I have something that sort-of works, but I can't get it all the way there. This is what the code looks like, with the sample data and output included:
#cdata<-read.csv(file.choose(),header=T)#read data from selected file, works, commented out because data is provided below
#cdata #check imported data
#Sample Data
# WALL NRPK CISC WHSC LKWH YLPR
#1 21 8 1 2 2 5
#2 57 9 3 1 0 1
#3 45 6 9 1 2 0
#4 17 10 2 0 3 0
#5 33 2 4 0 0 0
#6 41 4 13 1 0 0
#7 21 4 7 1 0 0
#8 32 7 1 7 6 0
#9 9 7 0 5 1 0
#10 9 4 1 0 0 0
x<-cdata[,c("WALL","NRPK","LKWH","YLPR")] #only select relevant species
i<-nrow(x) #count number of rows for bootstrapping
g<-ncol(x) #count number of columns for iteration
#build bootstrapping function, this works for the first column but doesn't iterate
bootfun <- function(bootdata, reps) {
boot <- function(bootdata){
s1=sample(bootdata, size=i, replace=TRUE)
ms1=mean(s1)
return(ms1)
} # a single bootstrap
bootrep <- replicate(n=reps, boot(bootdata))
return(bootrep)
} #replicates bootstrap of "bootdata" "reps" number of times and outputs vector of results
cvr1 <- bootfun(x$YLPR,50000) #have unsuccessfully tried iterating the location various ways (i.e. x[i])
cvrquantile<-quantile(cvr1,c(0.025,0.975))
cvrmean<-mean(cvr1)
vec<-c(cvrmean,cvrquantile) #puts results into a suitable form for output
vecr<-sapply(vec,round,1) #rounds results
vecr
2.5% 97.5%
28.5 19.4 38.1
#apply(x[1:g],2,bootfun) ##doesn't work in this case
#desired output:
#Species Mean LowerCI UpperCI
#WALL 28.5 19.4 38.1
#NRPK 6.1 4.6 7.6
#YLPR 0.6 0.0 1.6
I've also tried this using the boot package, and it works beautifully to iterate through the means but I can't get it to do the same with the confidence intervals. The "ordinary" code above also has the advantage that you can easily retrieve the bootstrapping results, which might be used for other calculations. For the sake of completeness here is the boot code:
#Bootstrapping using boot package
library(boot)
#data<-read.csv(file.choose(),header=TRUE) #read data from selected file
#x<-data[,c("WALL","NRPK","LKWH","YLPR")] #only select relevant columns
#x #check data
#Sample Data
# WALL NRPK LKWH YLPR
#1 21 8 2 5
#2 57 9 0 1
#3 45 6 2 0
#4 17 10 3 0
#5 33 2 0 0
#6 41 4 0 0
#7 21 4 0 0
#8 32 7 6 0
#9 9 7 1 0
#10 9 4 0 0
i<-nrow(x) #count number of rows for resampling
g<-ncol(x) #count number of columns to step through with bootstrapping
boot.mean<-function(x,i){boot.mean<-mean(x[i])} #bootstrapping function to get the mean
z<-boot(x, boot.mean,R=50000) #bootstrapping function, uses mean and number of reps
boot.ci(z,type="perc") #derive 95% confidence intervals
apply(x[1:g],2, boot.mean) #bootstrap all columns
#output:
#WALL NRPK LKWH YLPR
#28.5 6.1 1.4 0.6
I've gone through all of the resources I can find and can't seem to get things working. What I would like for output would be the bootstrapped means with the associated confidence intervals for each column. Thanks!
Note: apply(x[1:g],2, boot.mean) #bootstrap all columns doesn't do any bootstrap. You are simply calculating the mean for each column.
For bootstrap mean and confidence interval, try this:
apply(x,2,function(y){
b<-boot(y,boot.mean,R=50000);
c(mean(b$t),boot.ci(b,type="perc", conf=0.95)$percent[4:5])
})

Print varible names in table() with 2 binary variables in R

I'm sure I'll kick myself for not being able to figure this out, but when you have a table with 2 variables (i.e. cross-tab) and both are binary or otherwise have the same levels, how can you make R show which variable is displayed row-wise and which is column-wise?
For example:
> table(tc$tr, tc$fall_term)
0 1
0 1569 538
1 0 408
is a little confusing because it's not immediately obvious which is which. Of course, I checked out ?table but I don't see an option to do this, at least not a logical switch that doesn't require me to already know which is which.
I tried ftable but had the same problem.
The output I want would be something like this:
> table(tc$tr, tc$fall_term)
tr tr
0 1
fallterm 0 1569 538
fallterm 1 0 408
or
> table(tc$tr, tc$fall_term)
fallterm fallterm
0 1
tr 0 1569 538
tr 1 0 408
You can use the dnn option :
table(df$tr,df$fall_term) # impossible to tell the difference
0 1
0 18 33
1 15 34
table(df$tr,df$fall_term,dnn=c('tr','fall_term')) # you have the names
fall_term
tr 0 1
0 18 33
1 15 34
Note that it's easier (and safer) to do table(df$tr,df$fall_term,dnn=colnames(df))
Check out dimnames, and in particular their names. I’m using another example here since I don’t have your data:
x = HairEyeColor[, , Sex = 'Male']
names(dimnames(x))
# [1] "Hair" "Eye"
names(dimnames(x)) = c('Something', 'Else')
x
# Else
# Something Brown Blue Hazel Green
# Black 32 11 10 3
# Brown 53 50 25 15
# Red 10 10 7 7
# Blond 3 30 5 8

Tidying Time Intervals for Plotting Histogram in R

I'm doing some cluster analysis on the MLTobs from the LifeTables package and have come across a tricky problem with the Year variable in the mlt.mx.info dataframe. Year contains the period that the life table was taken, in intervals. Here's a table of the data:
1751-1754 1755-1759 1760-1764 1765-1769 1770-1774 1775-1779 1780-1784 1785-1789 1790-1794
1 1 1 1 1 1 1 1 1
1795-1799 1800-1804 1805-1809 1810-1814 1815-1819 1816-1819 1820-1824 1825-1829 1830-1834
1 1 1 1 1 2 3 3 3
1835-1839 1838-1839 1840-1844 1841-1844 1845-1849 1846-1849 1850-1854 1855-1859 1860-1864
4 1 5 3 8 1 10 11 11
1865-1869 1870-1874 1872-1874 1875-1879 1876-1879 1878-1879 1880-1884 1885-1889 1890-1894
11 11 1 12 2 1 15 15 15
1895-1899 1900-1904 1905-1909 1908-1909 1910-1914 1915-1919 1920-1924 1921-1924 1922-1924
15 15 15 1 16 16 16 2 1
1925-1929 1930-1934 1933-1934 1935-1939 1937-1939 1940-1944 1945-1949 1947-1949 1948-1949
19 19 1 20 1 22 22 3 1
1950-1954 1955-1959 1956-1959 1958-1959 1960-1964 1965-1969 1970-1974 1975-1979 1980-1984
30 30 2 1 40 40 41 41 41
1983-1984 1985-1989 1990-1994 1991-1994 1992-1994 1995-1999 2000-2003 2000-2004 2005-2006
1 42 42 1 1 44 3 41 22
2005-2007
14
As you can see, some of the intervals sit within other intervals. Thankfully none of them overlap. I want to simplify the intervals so intervals such as 1992-1994 and 1991-1994 all go into 1990-1994.
An idea might be to get the modulo of each interval and sort them into their new intervals that way but I'm unsure how to do this with the interval data type. If anyone has any ideas I'd really appreciate the help. Ultimately I want to create a histogram or barplot to illustrate the nicely.
If I understand your problem, you'll want something like this:
bottom <- seq(1750, 2010, 5)
library(dplyr)
new_df <- mlt.mx.info %>%
arrange(Year) %>%
mutate(year2 = as.numeric(substr(Year, 6, 9))) %>%
mutate(new_year = paste0(bottom[findInterval(year2, bottom)], "-",(bottom[findInterval(year2, bottom) + 1] - 1)))
View(new_df)
So what this does, it creates bins, and outputs a new column (new_year) that is the bottom of the bin. So everything from 1750-1754 will correspond to a new value of 1750-1754 (in string form; the original is an integer type, not sure how to fix that). Does this do what you want? Double check the results, but it looks right to me.

subset all columns in a data frame less than a certain value in R

I have a dataframe that contains 7 p-value variables.
I can't post it because it is private data but it looks like this:
>df
o m l c a aa ep
1.11E-09 4.43E-05 0.000001602 4.02E-88 1.10E-43 7.31E-05 0.00022168
8.57E-07 0.0005479 0.0001402 2.84E-44 4.97E-17 0.0008272 0.000443361
0.00001112 0.0005479 0.0007368 1.40E-39 3.17E-16 0.0008272 0.000665041
7.31E-05 0.0006228 0.0007368 4.59E-33 2.57E-13 0.0008272 0.000886721
8.17E-05 0.002307 0.0008453 4.58E-18 5.14E-12 0.0008336 0.001108402
Each column has values from 0-1.
I would like to subset the entire data frame by extracting all the values in each column less than 0.009 and making a new data frame. If I were to extract on this condition, the columns would have very different lengths. E.g. c has 290 values less than 0.009, and o has 300, aa has 500 etc.
I've tried:
subset(df,c<0.009 & a<0.009 & l<0.009 & m<0.009& aa<0.009 & o<0.009)
When I do this I just end up with a very small number of even columns which isn't what I want, I want all values in each column fitting the subset criteria in the data.
I then want to take this data frame and bin it into p-value range groups by using something like the summary(cut()) function, but I am not sure how to do it.
So essentially I would like to have a final data frame that includes the number of values in each p-value bin for each variable:
o# m# l# c# a# aa# ep#
0.00-0.000001 545 58 85 78 85 45 785
0.00001-000.1 54 77 57 57 74 56 58
0.001-0.002 54 7 5 5 98 7 5 865
An attempt:
sapply(df,function(x) table(cut(x[x<0.009],c(0,0.000001,0.001,0.002,Inf))) )
# o m l c a aa ep
#(0,1e-06] 2 0 0 5 5 0 0
#(1e-06,0.001] 3 4 5 0 0 5 4
#(0.001,0.002] 0 0 0 0 0 0 1
#(0.002,Inf] 0 1 0 0 0 0 0

Resources