Sampling from a dataframe and having to find proportions - r

I am quite beginner at R and I was hoping someone could guide me with this question.
I have a data frame that tells me whether 2,000 individuals voted or not. I have to sample 100 individuals and then find what proportion of them voted.
To do that, I decided to assign a number to each individual to differentiate them and do the sample. After that, however, I don't know how to add the variable to know whether they voted or not. Here is what I did:
vote$assignment <- c(1:2000)
sample <- sample(vote$assignment, 100, replace=F, set.seed(100))
sample100 <- as.data.frame(sample)
First lines of the dataframe:
vote assignment
1 1
1 2
0 3
1 4
1 5
1 6
0 7
1 8
1 9
1 10
Any ideas of how I get to that dataframe the information of whether they voted or not?
Thank you!

I am assuming you have 1/0 values in vote column where 1 is for the people who voted and 0 for the people who have not voted.
You can randomly select 100 individuals and take mean of vote column to get percentage of people who voted.
#Assign unique id's for each row
vote$assignment <- seq(nrow(vote))
#Selected random 100 rows
selected_rows <- sample(nrow(vote), 100)
#Get the percentage of people who voted.
percent_voted <- mean(vote$vote[selected_rows]) * 100

You can used the dplyr solution as suggested by Phil. Or you can consider the base solution.
# the data
df <- data.frame(id = 1:2000)
df$vote <- sample(c(1, 0), 2000, replace = TRUE, set.seed(123))
# the sampling
samp_id <- sample(df$id, 100, replace = FALSE)
df_vote <- df[samp_id, ]
id vote
225 225 0
1279 1279 0
1585 1585 1
946 946 1
1578 1578 0
1481 1481 0
651 651 1
1601 1601 0
354 354 1
203 203 0
prop <- mean(df_vote$vote)
prop

Related

how to find which rows are related by mathematical difference of x in R

i have a data frame with about 20k IDs of chemical compounds and the corresponding molecular weights, something like this:
ID <- c(1,2,3,4,5)
MASS <- c(324,162,508,675,670)
d <- data.frame(ID, MASS)
ID MASS
1 1 324
2 2 162
3 3 508
4 4 675
5 5 670
I would like to find a way to loop over the rows of the column MASS to find which masses are related by having a difference (positive or negative) of 162∓0.5. Then I would like to have a new column (d$DIFF) where the IDs that are linked by a MASS difference of 162∓0.5 are reported, while get 0 for those IDs when the condition is not met, in this example it would be something like this:
ID MASS DIFF
1 1 324 1&2
2 2 162 1&2
3 3 508 3&5
4 4 675 0
5 5 670 3&5
Thanks in advance for any help
Here's a base R solution using outer:
d$DIFF <- unlist(lapply(apply(outer(d$MASS, d$MASS,
function(x, y) abs((abs(x - y) - 162)) < 0.5), 1, which),
function(x) if(length(x) == 0)
return("0")
else
return(paste(x, collapse = " & "))))
This gives the result:
d
#> ID MASS DIFF
#> 1 1 324 2
#> 2 2 162 1
#> 3 3 508 5
#> 4 4 675 0
#> 5 5 670 3
Note that in your example data, there is at most a single match to other rows, but if you apply this technique to your real data you should get multiple hits for some rows separated by "&" as requested.
You should also note that whatever way you do this in your real data, you will have to make approximately 20K * 20K (400 million) comparisons, so it may take some time to complete, and may result in memory issues depending on your set-up.

Calculation within a pipe between different rows of a data frame

I have a tibble with a column of different numbers. I wish to calculate for every one of them how many others before them are within a certain range.
For example, let's say that range is 200 ; in the tibble below the result for the 5th number would be 2, that is the cardinality of the list {816, 705} whose numbers are above 872-1-200 = 671 but below 872.
I have thought of something along the lines of :
for every theRow of the tibble, do calculate the vector theTibble$number_list between(X,Y) ;
summing the boolean returned vector.
I have been told that using loops is less efficient.
Is there a clean way to do this within a pipe without using loops?
Not the way you asked for it, but you can use a bit of linear algebra. Should be more efficient and more simple than a loop.
number_list <- c(248,650,705,816,872,991,1156,1157,1180,1277)
m <- matrix(number_list, nrow = length(number_list), ncol = length(number_list))
d <- (t(m) - number_list)
cutoff <- 200
# I used setNames to name the result, but you do not need to
# We count inclusive of 0 in case of ties
setNames(colSums(d >= 0 & d < cutoff) - 1, number_list)
Which gives you the following named vector.
248 650 705 816 872 991 1156 1157 1180 1277
0 0 1 2 2 2 1 2 3 3
Here is another way that is pipe-able using rollapply().
library(zoo)
cutoff <- 200
df %>%
mutate(count = rollapply(number_list,
width = seq_along(number_list),
function(x) sum((tail(x, 1) - head(x, -1)) <= cutoff),
align = "right"))
Which gives you another column.
# A tibble: 10 x 2
number_list count
<int> <int>
1 248 0
2 650 0
3 705 1
4 816 2
5 872 2
6 991 2
7 1156 1
8 1157 2
9 1180 3
10 1277 3

How to resample data by clusters (block sampling) with replacement in R using Sampling package

This is my dummy data:
income <- as.data.frame.vector <- sample(1000:10000, 1000, replace=TRUE)
individuals <- as.data.frame.vector <- sample(1:50,1000,replace=TRUE)
datatest <- as.data.frame (cbind (income, individuals))
I know I can sample by individual rows with this code:
sample <- datatest[sample(nrow(datatest), replace=TRUE),]
Now, I want to extract random samples with replacement and equal probabilities of the dataset but sampling complete blocks of observations with the same individual code.
Note that there are 50 individuals, but 1000 observations. Some observations belong to the same individual, so I want to sample by individuals (clusters, in this case), not observations. I don't mind if the extracted samples differ slightly in the number of observations. How can I do that?
I have tried:
library(sampling)
samplecluster <- cluster (datatest, clustername=c("individuals"), size=50,
method="srswr")
But the outcome is not the sampled data. Am I missing something?
Well, it seems I was indeed missing something. After the cluster command you need to apply the getdata command (all from the Sampling Package). This way I do get the sample as I wanted, plus some additional columns.
samplecluster <- cluster (datatest, clustername=c("personid"), size=50, method="srswr")
Gives you:
head(samplecluster)
individuals ID_unit Replicates Prob
1 1 259 2 0.63583
2 1 178 2 0.63583
3 1 110 2 0.63583
4 1 153 2 0.63583
5 1 941 2 0.63583
6 1 667 2 0.63583
Then using getdata, I also get the original data on income sampled by whole clusters:
datasample <- getdata (datatest, samplecluster)
head(datasample)
income individuals ID_unit Replicates Prob
1 8567 1 259 2 0.63583
2 2701 1 178 2 0.63583
3 4998 1 110 2 0.63583
4 3556 1 153 2 0.63583
5 2893 1 941 2 0.63583
6 7581 1 667 2 0.63583
I am not sure if I am missing something. If you just want some of your individuals, you can create a smaller sample of them:
ind.sample <- sample(1:50, size = 10)
print(ind.sample)
# [1] 17 43 38 39 28 23 35 47 9 13
my.sample <- datatest[datatest$individuals %in% ind.sample) ,]
head(my.sample)
# income individuals
#21 9072 17
#97 5928 35
#122 9130 43
#252 4388 43
#285 8083 28
#287 1065 35
I guess a more generic approach would be to generate random indexes;
ind.unique <- unique(individuals)
ind.sample.index <- sample(1:length(ind.unique), size = 10)
ind.sample <- ind.unique[ind.sample.index]
print(ind.sample[order(ind.sample)])
my.sample <- datatest[datatest$individuals %in% ind.sample, ]
ind.counts <- aggregate(income ~ individuals, my.sample, FUN = length)
print(ind.counts)
I think its important to note that the dataset still needs to be expanded to include all the replicates.
sw<-data.frame(datasample[rep(seq_len(dim(datasample)[1]), datasample$Replicates),, drop = FALSE], row.names=NULL)
Might be helpful to someone

Find a function to return value based on condition using R

I have a table with values
KId sales_month quantity_sold
100 1 0
100 2 0
100 3 0
496 2 6
511 2 10
846 1 4
846 2 6
846 3 1
338 1 6
338 2 0
now i require output as
KId sales_month quantity_sold result
100 1 0 1
100 2 0 1
100 3 0 1
496 2 6 1
511 2 10 1
846 1 4 1
846 2 6 1
846 3 1 0
338 1 6 1
338 2 0 1
Here, the calculation has to go as such if quantity sold for the month of march(3) is less than 60% of two months January(1) and February(2) quantity sold then the result should be 1 or else it should display 0. Require solution to perform this.
Thanks in advance.
If I understand well, your requirement is to compare sold quantity in month t with the sum of quantity sold in months t-1 and t-2. If so, I can suggest using dplyr package that offer the nice feature of grouping rows and mutating columns in your data frame.
resultData <- group_by(data, KId) %>%
arrange(sales_month) %>%
mutate(monthMinus1Qty = lag(quantity_sold,1), monthMinus2Qty = lag(quantity_sold, 2)) %>%
group_by(KId, sales_month) %>%
mutate(previous2MonthsQty = sum(monthMinus1Qty, monthMinus2Qty, na.rm = TRUE)) %>%
mutate(result = ifelse(quantity_sold/previous2MonthsQty >= 0.6,0,1)) %>%
select(KId,sales_month, quantity_sold, result)
The result is as below:
Adding
select(KId,sales_month, quantity_sold, result)
at the end let us display only columns we care about (and not all these intermediate steps).
I believe this should satisfy your requirement. NA is the result column are due to 0/0 division or no data at all for the previous months.
Should you need to expand your calculation beyond one calendar year, you can add year column and adjust group_by() arguments appropriately.
For more information on dplyr package, follow this link

Include zero frequencies in frequency table for Likert data

I have a dataset with responses to a Likert item on a 9pt scale. I would like to create a frequency table (and barplot) of the data but some values on the scale never occur in my dataset, so table() removes that value from the frequency table. I would like it instead to present the value with a frequency of 0. That is, given the following dataset
# Assume a 5pt Likert scale for ease of example
data <- c(1, 1, 2, 1, 4, 4, 5)
I would like to get the following frequency table without having to manually insert a column named 3 with the value 0.
1 2 3 4 5
3 1 0 2 1
I'm new to R, so maybe I've overlooked something basic, but I haven't come across a function or option that gives the desired result.
EDIT:
tabular produces frequency tables while table produces contingency tables. However, to get zero frequencies in a one-dimensional contingency table as in the above example, the below code still works, of course.
This question provided the missing link. By converting the Likert item to a factor, and explicitly specifying the levels, levels with a frequency of 0 are still counted
data <- factor(data, levels = c(1:5))
table(data)
produces the desired output
table produces a contingency table, while tabular produces a frequency table that includes zero counts.
tabulate(data)
# [1] 3 1 0 2 1
Another way (if you have integers starting from 1 - but easily modifiable for other cases):
setNames(tabulate(data), 1:max(data)) # to make the output easier to read
# 1 2 3 4 5
# 3 1 0 2 1
If you want to quickly calculate the counts or proportions for multiple likert items and get your output in a data.frame, you may like the function psych::response.frequencies in the psych package.
Lets create some data (note that there are no 9s):
df <- data.frame(item1 = sample(1:7, 2000, replace = TRUE),
item2 = sample(1:7, 2000, replace = TRUE),
item3 = sample(1:7, 2000, replace = TRUE))
If you want to calculate the proportion in each category
psych::response.frequencies(df, max = 1000, uniqueitems = 1:9)
you get the following:
1 2 3 4 5 6 7 8 9 miss
item1 0.1450 0.1435 0.139 0.1325 0.1380 0.1605 0.1415 0 0 0
item2 0.1535 0.1315 0.126 0.1505 0.1535 0.1400 0.1450 0 0 0
item3 0.1320 0.1505 0.132 0.1465 0.1425 0.1535 0.1430 0 0 0
If you want counts, you can multiply by the sample size:
psych::response.frequencies(df, max = 1000, uniqueitems = 1:9) * nrow(df)
You get the following:
1 2 3 4 5 6 7 8 9 miss
item1 290 287 278 265 276 321 283 0 0 0
item2 307 263 252 301 307 280 290 0 0 0
item3 264 301 264 293 285 307 286 0 0 0
A few notes:
the default max is 10. Thus, if you have more than 10 response options, you'll have issues. Otherwise, in your case, and many Likert item cases, you could omit the max argument.
uniqueitems specifies the possible values. If all your values were present in at least one item, then this would be inferred from the data.
I think the function only works with numeric data. So if you have your likert categories coded "Strongly disagree", etc. it wont work.

Resources