¿How do apply weights to my data frame in r? - r

So I want is to apply weights to my observations from my data frame, also I already have an entire column with the weights that I want to apply to my data.
So this how my data frame looks like.
weight
count
3
67
7
355
8
25
7
2
And basically what I want is to weight each value of my COUNT column with the respective weight of my column WEIGHT. For example, the value 67 of of my column Count should be weighted by 3 and the value of 355 of my column Count should be weighted by 7 and so on.
I try to use this code from the questionr package.
wtd.table(data1$count, weights = data1$weight)
But this code altered my data frame and end up reducing my 1447 rows to just 172 entries. What I want is to maintain my exact number of entries.
The output that I want, would be something like this. I just want to add another column to my data frame with the weighted values.
Count
Count applying weights
67
####
355
###

I am still not sure how to apply weights to the count data in the way you want.
I just want to show that you can create a new column based on the previous column in a convenient way by using dplyr. For example:
mydf
# weight count
# 1 3 67
# 2 7 355
# 3 8 25
# 4 7 2
mydf %>% mutate(weightedCount = weight*count,
percentRank = percent_rank(weightedCount),
cumDist = cume_dist(weightedCount))
# weight count weightedCount percentRank cumDist
# 1 3 67 201 0.6666667 0.75
# 2 7 355 2485 1.0000000 1.00
# 3 8 25 200 0.3333333 0.50
# 4 7 2 14 0.0000000 0.25
Here, weightedCount is multiplication of weight and count, percentRank shows the rank of each data in weightedCount and cumDist shows cumulative distribution of the data in weightedCount.
This is an example. You can create another column and apply other functions in the similar way.

Related

Mapping a dataframe (with NA) to an n by n adjacency matrix (as a data.frame object)

I have a three-column dataframe object recording the bilateral trade data between 161 countries, the data are of dyadic format containing 19687 rows, three columns (reporter (rid), partner (pid), and their bilateral trade flow (TradeValue) in a given year). rid or pid takes a value from 1 to 161, and a country is assigned the same rid and pid. For any given pair of (rid, pid) in which rid =/= pid, TradeValue(rid, pid) = TradeValue(pid, rid).
The data (run in R) look like this:
#load the data from dropbox folder
library(foreign)
example_data <- read.csv("https://www.dropbox.com/s/hf0ga22tdjlvdvr/example_data.csv?dl=1")
head(example_data, n = 10)
rid pid TradeValue
1 2 3 500
2 2 7 2328
3 2 8 2233465
4 2 9 81470
5 2 12 572893
6 2 17 488374
7 2 19 3314932
8 2 23 20323
9 2 25 10
10 2 29 9026220
The data were sourced from UN Comtrade database, each rid is paired with multiple pid to get their bilateral trade data, but as can be seen, not every pid has a numeric id value because I only assigned a rid or pid to a country if a list of relevant economic indicators of that country are available, which is why there are NA in the data despite TradeValue exists between that country and the reporting country (rid). The same applies when a country become a "reporter," in that situation, that country did not report any TradeValue with partners, and its id number is absent from the rid column. (Hence, you can see rid column begins with 2, because country 1 (i.e., Afghanistan) did not report any bilateral trade data with partners). A quick check with summary statistics helps confirm this
length(unique(example_data$rid))
[1] 139
# only 139 countries reported bilateral trade statistics with partners
length(unique(example_data$pid))
[1] 162
# that extra pid is NA (161 + NA = 162)
Since most countries report bilateral trade data with partners and for those who don't, they tend to be small economies. Hence, I want to preserve the complete list of 161 countries and transform this example_data dataframe into a 161 x 161 adjacency matrix in which
for those countries that are absent from the rid column (e.g., rid == 1), create each of them a row and set the entire row (in the 161 x 161 matrix) to 0.
for those countries (pid) that do not share TradeValue entries with a particular rid, set those cells to 0.
For example, suppose in a 5 x 5 adjacency matrix, country 1 did not report any trade statistics with partners, the other four reported their bilateral trade statistics with other (except country 1). The original dataframe is like
rid pid TradeValue
2 3 223
2 4 13
2 5 9
3 2 223
3 4 57
3 5 28
4 2 13
4 3 57
4 5 82
5 2 9
5 3 28
5 4 82
from which I want to convert it to a 5 x 5 adjacency matrix (of data.frame format), the desired output should look like this
V1 V2 V3 V4 V5
1 0 0 0 0 0
2 0 0 223 13 9
3 0 223 0 57 28
4 0 13 57 0 82
5 0 9 28 82 0
And using the same method on the example_data to create a 161 x 161 adjacency matrix. However, after a couple trial and error with reshape and other methods, I still could not get around with such conversion, not even beyond the first step.
It will be really appreciated if anyone could enlighten me on this?
I cannot read the dropbox file but have tried to work off of your 5-country example dataframe -
country_num = 5
# check countries missing in rid and pid
rid_miss = setdiff(1:country_num, example_data$rid)
pid_miss = ifelse(length(setdiff(1:country_num, example_data$pid) == 0),
1, setdiff(1:country_num, example_data$pid))
# create dummy dataframe with missing rid and pid
add_data = as.data.frame(do.call(cbind, list(rid_miss, pid_miss, NA)))
colnames(add_data) = colnames(example_data)
# add dummy dataframe to original
example_data = rbind(example_data, add_data)
# the dcast now takes missing rid and pid into account
mat = dcast(example_data, rid ~ pid, value.var = "TradeValue")
# can remove first column without setting colnames but this is more failproof
rownames(mat) = mat[, 1]
mat = as.matrix(mat[, -1])
# fill in upper triangular matrix with missing values of lower triangular matrix
# and vice-versa since TradeValue(rid, pid) = TradeValue(pid, rid)
mat[is.na(mat)] = t(mat)[is.na(mat)]
# change NAs to 0 according to preference - would keep as NA to differentiate
# from actual zeros
mat[is.na(mat)] = 0
Does this help?

How to find group of rows of a data frame where error occures

I have a two-column dataframe contataining thousands of IDs where each ID has hundreds of data rows, in other words a data frame of about 6 million rows. I am grouping (using either dplyr or data.table) this data frame by IDs and performing a "tso" (outlier detection) function on grouped data frame. The problem is after hours of computation it returns me an error related to ARIMA specification of one of the IDs. Question is how can I identify the ID (or the row number) where my function is returning error?? (if I detect it then I can remove that ID from dataframe)
I tried to manually perform my function on subgroups of this dataframe however I cannot reach the erroneous ID because there are thousands of IDs so it takes me weeks to find them this way.
outlier.detection <- function(x,iter) {
y <- as.ts(x)
out2 <- tso(y,maxit.iloop = iter,tsmethod = "auto.arima",remove.method = "bottom-up",cval=3)
y[out2$outliers$ind] <- NA
return(y)}
df <- data.table(outlying1);setkey(df,id)
test <- df[,list(new.weight = outlier.detection(weight,iter=1)),by=id]
the above function finds the annomalies and replace them with NAs. here is an example,
ID weight
1 a 50
2 a 50
3 a 51
4 a 51.5
5 a 52
6 b 80
7 b 81
8 b 81.5
9 b 90
10 b 82
it will look like the following,
ID weight
1 a 50
2 a 50
3 a 51
4 a 51.5
5 a 52
6 b 80
7 b 81
8 b 81.5
9 b NA
10 b 82

Assigning logical value to values higher than given threshold for each case across each year

I have a data frame resembling the extract below:
set.seed(1)
smpl_df <- data.frame(year = c(1500:2011), case = LETTERS[1:4])
smpl_df$var_one <- sample(100, size = nrow(smpl_df), replace = TRUE)
I'm interested in adding one more column to this data frame. I'm interested in the column to take the value 1 if the values in the column var_one were higher than a given threshold for all of the consecutive years represented in the data set. For example, in its present format the table looks like that:
head(smpl_df)
year case var_one
1 1500 A 27
2 1501 B 38
3 1502 C 58
4 1503 D 91
5 1504 A 21
6 1505 B 90
I would like to add a column to the data table (values for the new column are not right, just introduced as a way of example):
year case var_one var_one_higher_than_80_for_all_yrs_for_this_case
1 1500 A 27 0
2 1501 B 38 0
3 1502 C 58 0
4 1503 D 91 1
5 1504 A 21 0
6 1505 B 90 1
Edit
To add to the post following useful points expressed in the comments below. The long table that I'm currently working with could be obtained from the wide table below. In the example below, I added column NewColumn that takes values Yes if for a given case value was higher than 2 and No if the value was lower or equal 2 for all the years. I want to achieve the same effect but on my long table (sample_df).
Edit 2
Following the useful comments concerning the desired final output, my intention is to generate a column that would correspond to the last column in the table below.
maybe be helpful ifelse structure:
smpl_df$var_one_higher <- ifelse("your func",1,0)

drawing multiple boxplots from imputed data in R

I have an imputed dataset that I'm analysing, and I'm trying to draw boxplots, but I can't wrap my head around the proper procedure.
my data (a sample, original has 20 observations per imputation and 13 vars per group, all values range from 0 to 25):
.imp .id FTE_RM FTE_PD OMZ_RM OMZ_PD
1 1 25 25 24 24
1 2 4 0 2 6
1 3 11 5 3 2
1 4 12 3 3 3
2 1 20 15 15 15
2 2 4 1 2 3
2 3 0 0 0 6
2 4 20 0 0 0
.imp signifies the imputation round, .id the identifer for each observartion.
I want to draw all the FTE_* variables in a single plot (and the `OMZ_* in another), but wonder what to do with all the imputations, can I just include all values? The imputated data now has 500 observations. With for instance an ANOVA I'd need to average the ANOVA results by 5 to get back to 20 observations. But is this needed for a boxplot as well, since I only deal with medians, means, max. and min.?
Such as:
data_melt <- melt(df[grep("^FTE_", colnames(df))])
ggplot(data_melt, aes(x=variable, y=value))+geom_boxplot()
I've played a couple of times with ggplot, but consider myself a complete newbie.
I assume you want to keep the identifier for .imp and .id after melting so rather put:
data_melt <- melt(df,c(".imp",".id"))
For completeness of the dataframe it probably helps to introduce a column that identifies the type - FTE vs. OMZ:
data_melt$type <- ifelse(grepl("FTE",data_melt$variable),"FTE","OMZ")
Having this data.frame you can, for example, facet on the type (alternatively you can just use a simple filter statement on data_melt to restrict to one type):
ggplot(data_melt, aes(x=variable, y=value))+geom_boxplot()+facet_wrap(~type,scales="free_x")
This would look like this.
EDIT: fixed the data mess-up

Re-sample a data frame with panel dimension

I have a data set consisting of 2000 individuals. For each individual, i:2000 , the data set contains n repeated situations. Letting d denote this data set, each row of dis indexed by i and n. Among other variables, d has a variable pid which takes on identical value for an individual across different (situations) rows.
Taking into consideration the panel nature of the data, I want to re-sample d (as in bootstrap):
with replacement,
store each re-sample data as a data frame
I considered using the sample function but could not make it work. I am a new user of r and have no programming skills.
The data set consists of many variables, but all the variables have numeric values. The data set is as follows.
pid x y z
1 10 2 -5
1 12 3 -4.5
1 14 4 -4
1 16 5 -3.5
1 18 6 -3
1 20 7 -2.5
2 22 8 -2
2 24 9 -1.5
2 26 10 -1
2 28 11 -0.5
2 30 12 0
2 32 13 0.5
The first six rows are for the first person, for which pid=1, and the next sex rows, pid=2 are different observations for the second person.
This should work for you:
z <- replicate(100,
d[d$pid %in% sample(unique(d$pid), 2000, replace=TRUE),],
simplify = FALSE)
The result z will be a list of dataframes you can do whatever with.
EDIT: this is a little wordy, but will deal with duplicated rows. replicate has its obvious use of performing a set operation a given number of times (in the example below, 4). I then sample the unique values of pid (in this case 3 of those values, with replacement) and extract the rows of d corresponding to each sampled value. The combination of a do.call to rbind and lapply deal with the duplicates that are not handled well by the above code. Thus, instead of generating dataframes with potentially different lengths, this code generates a dataframe for each sampled pid and then uses do.call("rbind",...) to stick them back together within each iteration of replicate.
z <- replicate(4, do.call("rbind", lapply(sample(unique(d$pid),3,replace=TRUE),
function(x) d[d$pid==x,])),
simplify=FALSE)

Resources