Say I have three individuals and I know the payment they require to enter different amounts of land into a scheme. I want to know how much land each participant would enter into the scheme for a given payment rate. I want them to enter the max amount they are willing for that payment rate. Previously I did this with a long ifelse statement, but that will not run inside a loop, so I'm looking for an alternative.
In this example, I've excluded a load of areas so it just presents as if participants can enter 50, 49 or 1 unit(s) of area.
paym_sh1a=200
paym_area_50 <- c(250, 150, 210)
paym_area_49 <- c(240, 130, 190)
paym_area_1 <- c(100, 20, 90)
area_enrolled<-
ifelse(paym_area_50<paym_sh1a,50,ifelse(paym_area_49<paym_sh1a,49,
ifelse(paym_area_1<paym_sh1a,1,0)))
You could create a table of your values:
paym_area = rbind(paym_area_50, paym_area_49, paym_area_1)
And then use vectorised operations more effectively. In particular, since your thresholds are decreasing, you could compare the whole table to the sh1a value, and count how many rows are below it:
(sums = colSums(paym_area < paym_sh1a))
# [1] 1 3 2
This vector can be used as an index into your results:
values = c(0, 50, 49, 1)
(result = values[sums + 1L])
# [1] 50 1 49
Related
I am trying to carry out following operation in R
I have different series of data,
series 1: 75, 56, 100, 23, 38, 40 series 2: 60, 18, 86, 100, 44
I would like to annex these data. To do so, I have to multiply series 1 by 1.5 to make last data of series 1 (40) match with the first data of the second series (60) (40*1.5=60)
Same way I would like to match many different series, but for other series I will need to multiply by other numbers. For another series i.e Series1: ...20 ; Series 2: 80... I would have to multiply it by 4.
How can I carry out such an operation to many series in many data frames?
Thanks in advance,
Given two vectors x and y, the function f(x,y) below will convert x the way you desire.
f <- function(x,y) x*(y[1]/x[length(x)])
Usage:
x = c(75,56,100,23,38,40)
y = c(60,18,86,100,44)
f(x,y)
Output:
[1] 112.5 84.0 150.0 34.5 57.0 60.0
However, how this approach gets applied to "many series in many data frames" depends on the actual structure you have, and what type of output you want.
I have struggled with this question for a long time, and I have looked extensively on the Internet but never found a solution. Imagine I have the following dataset:
df <- data.frame("Individuals" = c(1,2,3,4,5,6),
"Height" = c(150, 200, 200, 200, 150, 150),
"Weight" = c(100, 50, 50, 100, 50, 100))
This dataset has 6 individuals. For each individual, we measure two attributes: height (takes value 150 cm or 200 cm) and weight (takes value 50kg and 100kg). I want to create a categorical variable that classifies together individuals whose height and weight are equal. In this case, this variable would look like this:
output_df <- data.frame("Individuals" = c(1,2,3,4,5,6),
"Height" = c(150, 200, 200, 200, 150, 150),
"Weight" = c(100, 50, 50, 100, 50, 100),
"Groups of individuals" = c(1, 2, 2, 3, 4, 1))
There are four groups of individuals with equal values in both variables. In group 1, all have height = 150 and weight = 100, in group 2 all have height = 200 and weight = 50 , in group 3 all have a height = 200 and weight = 100 kg (there is only one individual in this group, but this would still be a separate "group of individuals" insofar it has a different combination of values of the other variables compared to the rest of the groups) and in group 4 all have a height of 150 cm and weight 50 kg (same as for group 3, only one individual in this group).
In this case, it is easy to make this classification manually and thus create the variable "Group of individuals".
Now imagine I have more variables beyond height and weight, and I want to create the variable "Group of individuals" without knowing in advance the possible values height and weight (and other variables, if they exist) take. So I want to create a new variable, whose value depends on which group of observations a given observation is. The group of observations are defined by equality conditions; i.e., an observation is classified as pertaining to a given group of observations whose values across several variables are exactly equal.
I am finding it extremely difficult to write down the condition that defines this new variable in a generalized manner. The number of values this variable takes is not known a priori (depends on the specific set of individuals you have). It has a theoretical minum or 1 (all observations have equal values for all variables) and a theoretical maximum equal to the number of observations (all observations have different values for all variables, there are no groups of individuals with equal values for different variables). In my application, I want to create this variable for different datasets, therefore it will have a different number of values for each dataset.
My best attempts have involved the use of group_by() and case_when() within the tidyverse. I assume there has to be a way to express this as a if_else statement or some other type of conditional statement. Another intuition is that creating this variable might entail some kind of pivoting, creation of the variable, and then pivoting back again (also within the tidyverse: https://tidyr.tidyverse.org/articles/pivot.html ). I think the reason why the idea is challenging to me is that you create a variable that for each observations takes a given value as defined by equality conditions across observations, and not variables, which gets me very confused. This is why I guess it might be done with pivoting, because I think one might be able to translate this problem as creating a variable as a function of other variables first, and then come back to a dataset in which this variable is a function of equality across observations.
I really hope the formulation of the questiom is not too confusing. I find the issue so confusing myself, that it is also difficult to express it. I guess that if I could express it better, I might be able to solve it.
Thank you so much!
One way would be to create a unique key combining Height & Weight values and use match and unique to get group number.
key <- with(df, paste(Height, Weight, sep = '-'))
df$group <- match(key, unique(key))
df
# Individuals Height Weight group
#1 1 150 100 1
#2 2 200 50 2
#3 3 200 50 2
#4 4 200 100 3
#5 5 150 50 4
#6 6 150 100 1
If the order of groups are not important and you only care that people with same height and weight get the same group number, we can also use cur_group_id from dplyr.
library(dplyr)
df <- df %>% group_by(Height, Weight) %>% mutate(group = cur_group_id())
Apologies if this has been asked before, but I've searched for a while and can't find anything to answer my question. I'm somewhat comfortable using R but never really learned the fundamentals. Here's what I'm trying to do.
I've got a vector (call it "responseTimes") that looks something like this:
150 50 250 200 100 150 250
(It's actually much longer, but I'm truncating it here.)
I've also got a data frame where one column, timeBin, is essentially counting up by 50 from 0 (so 0 50 100 150 200 250 etc).
What I'm trying to do is to count how many values in responseTimes are less than or equal to each row in the data frame. I want to store these counts in a new column of my data frame. My output should look something like this:
timeBin counts
0 0
50 1
100 2
150 4
200 5
250 7
I know I can use the sum function to compare vector elements to some constant (e.g., sum(responseTimes>100) would give me 5 for the data I've shown here) but I don't know how to do this to compare to a changing value (that is, to compare to each row in the timeBin column).
I'd prefer not to use a loop, as I'm told those can be particularly slow in R and I have quite a large data set that I'm working with. Any suggestions would be much appreciated! Thanks in advance.
You can use sapply this way:
> timeBin <- seq(0, 250, by=50)
> responseTimes <- c(150, 50, 250, 200, 100, 150, 250 )
>
> # using sapply (after all `sapply` is a loop)
> ans <- sapply(timeBin, function(x) sum(responseTimes<=x))
> data.frame(timeBin, counts=ans) # your desired output.
timeBin counts
1 0 0
2 50 1
3 100 2
4 150 4
5 200 5
6 250 7
That might help:
responseTimes <- c(150, 50, 250, 200, 100, 150, 250)
bins1 <- seq(0, 250, by = 50)
sahil1 <- function(input = responseTimes, binsx = bins1) {
tablem <- table(cut(input, binsx)) # count of input across bins
tablem <- cumsum(tablem) # cumulative sums
return(as.data.frame(tablem)) # table to data frame
}
I am very new to using R, so please go easy on me.
I am working with data from a survey that was administered twice to roughly the same group of respondents. Among other things, the survey asked respondents to list their height.
Before the survey was administered for a second round, some of the original wave-one respondents left the sample, and some new respondents arrived. When the survey was administered a second time, it began with a filter question that asked whether the respondent had taken the survey before. Respondents who took the first survey were not asked about their height on the second survey, but "new" respondents were asked about their height.
I am trying to create a variable that represents the height of all respondents who participated in the wave-two survey. Because respondents who took the wave-one survey have missing data for the wave-two height question, I need to replace these missing values with their values from the wave-one survey.
I realize this is probably an easy fix, but I am not sure how to do it. My data:
Height.W1 = A vector containing the height in feet for respondents who took the first survey.
Height.W2 = A similar variable for respondents who took the second survey.
Interview.Status = A variable indicating whether the respondent took the first survey. Let's say a value of "1" means the respondent took the first survey, and therefore has missing data for the Height.W2 variable.
How can I replace values for Height.W2 with values from Height.W1, conditional on whether Interview.Status==1?
Thanks in advance.
If I am understanding your question correctly:
Making some data to work with, some with Interview.Status == 1, and some with Interview.Status==0.
> df <- structure(list(Height.W1 = c(60, 62, 58, 64), Height.W2 = c(60, NA, 58, NA), Interview.Status = c(0, 1, 0, 1)), .Names = c("Height.W1", "Height.W2", "Interview.Status"), row.names = c(NA, 4L), class = "data.frame")
> df
Height.W1 Height.W2 Interview.Status
1 60 60 0
2 62 NA 1
3 58 58 0
4 64 NA 1
I subset by those with Interview.Status == 1 and replace Height.W2, which is NA, with Height.W1.
> df$Height.W2[df$Interview.Status == 1] <- df$Height.W1[df$Interview.Status == 1]
> df
Height.W1 Height.W2 Interview.Status
1 60 60 0
2 62 62 1
3 58 58 0
4 64 64 1
It is clear that Height.W2 has NA for Interview.Status==1, but it is whether Height.W1 has NA or not for Interview.Status!=1. Assuming it has, a one-liner could be
Height <- apply(df[, c("Height.W1", "Height.W2")], 1, min, na.rm = T)
or max, sum or any other function for that matter.
What is the most efficient way to sample a data frame under a certain constraint?
For example, say I have a directory of Names and Salaries, how do I select 3 such that their sum does not exceed some value. I'm just using a while loop but that seems pretty inefficient.
You could face a combinatorial explosion. This simulates the selection of 3 combinations of the EE's from a set of 20 with salaries at a mean of 60 and sd 20. It shows that from the enumeration of the 1140 combinations you will find only 263 having sum of salaries less than 150.
> sum( apply( combn(1:20,3) , 2, function(x) sum(salry[x, "sals"]) < 150))
[1] 200
> set.seed(123)
> salry <- data.frame(EEnams = sapply(1:20 ,
function(x){paste(sample(letters[1:20], 6) ,
collapse="")}), sals = rnorm(20, 60, 20))
> head(salry)
EEnams sals
1 fohpqa 67.59279
2 kqjhpg 49.95353
3 nkbpda 53.33585
4 gsqlko 39.62849
5 ntjkec 38.56418
6 trmnah 66.07057
> sum( apply( combn(1:NROW(salry), 3) , 2, function(x) sum(salry[x, "sals"]) < 150))
[1] 263
If you had 1000 EE's then you would have:
> choose(1000, 3) # Combination possibilities
# [1] 166,167,000 Commas added to output
One approach would be to start with the full data frame and sample one case. Create a data frame which consists of all the cases which have a salary less than your constraint minus the selected salary. Select a second case from this and repeat the process of creating a remaining set of cases to choose from. Stop if you get to the number you need (3), or if at any point there are no cases in the data frame to choose from (reject what you have so far and restart the sampling procedure).
Note that different approaches will create different probability distributions for a case being included; generally it won't be uniform.
How big is your dataset? If it is small (and small really depends on your hardware), you could just list all groups of three, calculate the sum, and sample from that.
## create data frame
N <- 100
salary <- rnorm(N))
## list all possible groups of 3 from this
x <- combn(salary, 3)
## the sum
sx <- colSums(x)
sxc <- sx[sx<1]
## sampling with replacement
sample(sxc, 10, replace=TRUE)