How can you weight data for splitting in R? - r

I want to split my data into a development and validation set. Data should be split by ID. For around 30% of my data individuals I have rich observations, with the remaining 70% having sparse data.
For my development set, I want to include all of the individuals with rich data (even if it might not be good practice to do so), and then fill up with individuals with sparse data. The validation set should not contain any rich data.
Some example data:
# A tibble: 6 x 4
ID CONC TIME RICH
<chr> <dbl> <dbl> <dbl>
1 A 55.0 1 1
2 A 52.6 2 1
3 A 50.2 3 1
4 A 47.9 4 1
5 E 40.7 2 0
6 E 38.3 2 0
I am aware of the sample() function, but I am at a loss at how to "randomly" split data with weights.
EDIT: All IDs have several observations, and so the randomization should be on the ID depending on RICH. An individual is assigned as having rich data if there are more than n observations.
EDIT 2: The 75%/25% split should be on IDs.

Here is one raw approach :
#Unique ID's
n <- unique(df$ID)
#Get all rich ID's
rich_set <- unique(df$ID[df$RICH == 1])
#count number of unique ID's in development set
development_n <- ceiling(length(n) * 0.75)
#select random Id's to complete development set
devel_ID <- sample(setdiff(n, rich_set), development_n - length(rich_set))
#Subset data
development_set <- subset(df, ID %in% c(rich_set, devel_ID))
validaton_set <- subset(df, !ID %in% c(rich_set, devel_ID)))

Related

How to randomize ascending list of values into two similar groups in R

I want to randomize ascending list of values in to two similar groups in R.
two statistical similar groups, meaning the mean timen (performance) is the same. (the lower timen the better) i would like a evenly distributed groups with fast and slow skiers
I will have a Pre-test and want to randomize some alpine ski athletes besed on there performance.
The datassett will look like this; (this is test datasett, the real one (with n =40) will i get at thePre-test)
# A tibble: 4 × 5
# Groups: BIB. [4]
BIB. `11` `99` `77` performance
<int> <dbl> <dbl> <dbl> <dbl>
1 1 14.2 NA NA NA
2 2 14.4 15.0 NA -0.600
3 3 14.3 14.6 NA -0.310
4 77 NA 12.9 61.4 NA
can anyone help me ?
My approach might be to sample it randomly some number of times (100?) then evaluate the means in each, and pick the smallest.
Here is how you would do that.
#invent some initial data
(hiris <- head(iris,n=20) |> select(perftime = Sepal.Length) |> mutate(id=row_number()))
# make 100 scrambled datasets, then evaluate them for closest mean
set.seed(42)
(nr <- nrow(hiris))
possible_sets <- map(1:100,
~slice_sample(.data = hiris,n = nr,replace=FALSE) |>
mutate(group=1*(row_number()<nr/2)))
(evaluations <- map_dbl(possible_sets,~{
step1 <- group_by(.x,group) |> summarise(m=mean(perftime))
sqrt((step1$m[[1]]-step1$m[[2]])^2)
}))
(set_to_choose <- which.min(evaluations))
#to see the evaluation
plot(seq_along(evaluations),evaluations)
points(x=set_to_choose,
y=evaluations[set_to_choose],
col="red",pch=15)
#to use the 'best' set
(chosen_set <- possible_sets[set_to_choose])

Need to get total for Column in R

I have done the code up to this point, but have a column called score where I have to add the total together in the rscores tibble.
library(tidyverse)
responses <- read_csv("responses.csv")
qformats <- read_csv("qformats.csv")
scoring <- read_csv("scoring.csv")
rlong <- gather(responses,Question, Response, Q1:Q10)
rlong_16 <- filter(rlong, Id == 16)
rlong2 <- inner_join(rlong_16, qformats, by = "Question")
rscores <- inner_join(rlong_2, scoring)
What line of code do I add next to get the total for this column? I have been scratching my head for hours. Any help is appreciated :)
> head(rscores)
# A tibble: 6 x 5
Id Question Response QFormat Score
<dbl> <chr> <chr> <chr> <dbl>
1 16 Q1 Slightly Disagree F 0
2 16 Q2 Definitely Agree R 0
3 16 Q3 Slightly Disagree R 1
4 16 Q4 Definitely Disagree R 1
5 16 Q5 Slightly Agree R 0
6 16 Q6 Slightly Agree R 0
colSums() is overkill if you just need the sum of one column, and it will give you an error if any other column in the tibble/data.frame/etc. is not convertible to numeric. In you case, there's at least one character (chr) column that can't be summed. Typically you'd use rowSums or colSums on a matrix as opposed to a data frame.
Just use sum function on the one column: sum(rscores$Score). Best of luck.

Dummy variables based on values from different columns

I currently have a data frame that looks as such:
dat2<-data.frame(
ID=c(100,101,102,103),
DEGREE_1=c("BA","BA","BA","BA"),
DEGREE_2=c(NA,"BA",NA,NA),
DEGREE_3=c(NA,"MS",NA,NA),
YEAR_DEGREE_1=c(1980,1990,2000,2004),
YEAR_DEGREE_2=c(NA,1992,NA,NA),
YEAR_DEGREE_3=c(NA,1996,NA,NA)
)
ID DEGREE_1 DEGREE_2 DEGREE_3 YEAR_DEGREE_1 YEAR_DEGREE_2 YEAR_DEGREE_3
100 BA <NA> <NA> 1980 NA NA
101 BA BA MS 1990 1992 1996
102 BA <NA> <NA> 2000 NA NA
103 BA <NA> <NA> 2004 NA NA
I would like to create dummy variables coded 0/1 based on what kind of degree was earned, using the completion of one BA degree as the base.
The completed data frame would have a second BA degree dummy, an MS degree dummy, and so on. For example, for ID 101, both dummies would have a value of 1. The completion of two MS degrees would not require a dummy, i.e. if someone completed two MS degrees, then the MS degree dummy would be 1 and there would be no dummy to signify completing two MS degrees.
Like such
This is a simple snapshot of a much bigger data frame that has many different degrees types besides BA and MS, so it isn't ideal for me to create if/else statements for every single degree type.
Any advice would be appreciated.
You could also include new columns and assign the value based on the DEGREE columns.
Including new columns, with all values equal 0:
dat2 <- cbind(dat2, BA_2nd = 0)
dat2 <- cbind(dat2, MS = 0)
Changing the value to 1, based on your conditions:
dat2[!is.na(dat2$DEGREE_2), 8] <- 1
dat2[!is.na(dat2$DEGREE_3) & dat2$DEGREE_3 == "MS", 9] <- 1
dat2
You can adapt it to all the conditions you have. This code generates only the output table that you included.

Check if a variable is time invariant in R

I tried to search an answer to my question but I find the right answer for Stata (I am using R).
I am using a national survey to study which variables influence the investment in complementary pension (it is voluntary in my country).
The survey is conducted every two years and some individuals are interviewed more than one time. I filtered the df in order to have only the individuals present more than one time trought the filter command. This is an example from the original survey already filtered:
year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 1
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1
2008 4 1972 F 33000 0
2010 4 1972 F 35000 0
where id is the individual, y.b is year of birth, pens is a dummy which takes value 1 if the individual invests in a complementary pension form.
I wanted to run a FE regression so I load the plm package and then I set the df like this:
df.p <- plm.data(df, c("id", "year")
After this command, I expected that constant variables were deleted but after running this regression:
pan1 <- plm (pens ~ woman + age + I(age^2) + high + medium + north + centre, model="within", effect = "individual", data=dd.p, na.action = na.omit)
(where woman is a variable which takes value 1 if the individual is a woman, high, medium refer to education level and north, centre to geographical regions) and after the command summary(pan1) the variable woman is still present.
At this point I think that there are some mistakes in the survey (for example sex was not insert correctly and so it wasn't the same for the same id), so I tried to find a way to check if for each id, sex is constant.
I tried this code but I am sure it is not correct:
df$x <- ifelse(df$id==df$id & df$sex==df$sex,1,0)
the basic idea shuold be like this:
df$x <- ifelse(df$id=="1" & df$sex=="F",1,0)
but I can't do it manually since the df is composed up to 40k observations.
If you know another way to check if a variable is constant in R I will be glad.
Thank you in advance
I think what you are trying to do is calculate the number of unique values of sex for each id. You are hoping it is 1, but any cases of 2 indicate a transcription error. The way to do this in R is
any(by(df$sex,df$id,function(x) length(unique(x))) > 1)
To break that down, the function length(unique(x)) tells you the number of different unique values in a vector. It's similar to levels for a factor (but not identical, since a factor can have levels not present).
The function by calculates the given function on each subset of df$sex according to df$id. In other words, it calculates length(unique(df$sex)) where df$id is 1, then 2, etc.
Lastly, any(... > 1) checks if any of the results are more than one. If they are, the result will be TRUE (and you can use which instead of any to find which ones). If everything is okay, the result will be FALSE.
We can try with dplyr
Example data:
df=data.frame(year=c(2002,2002,2004,2004,2006,2008,2008,2010),
id=c(1,2,1,2,3,3,4,4),
sex=c("F","M","M","M","M","M","F","F"))
Id 1 is both F and M
library(dplyr)
df%>%group_by(id)%>%summarise(sexes=length(unique(sex)))
# A tibble: 4 x 2
id sexes
<dbl> <int>
1 1 2
2 2 1
3 3 1
4 4 1
We can then filter:
df%>%group_by(id)%>%summarise(sexes=length(unique(sex)))%>%filter(sexes==2)
# A tibble: 1 x 2
id sexes
<dbl> <int>
1 1 2

Re-sample a data frame with panel dimension

I have a data set consisting of 2000 individuals. For each individual, i:2000 , the data set contains n repeated situations. Letting d denote this data set, each row of dis indexed by i and n. Among other variables, d has a variable pid which takes on identical value for an individual across different (situations) rows.
Taking into consideration the panel nature of the data, I want to re-sample d (as in bootstrap):
with replacement,
store each re-sample data as a data frame
I considered using the sample function but could not make it work. I am a new user of r and have no programming skills.
The data set consists of many variables, but all the variables have numeric values. The data set is as follows.
pid x y z
1 10 2 -5
1 12 3 -4.5
1 14 4 -4
1 16 5 -3.5
1 18 6 -3
1 20 7 -2.5
2 22 8 -2
2 24 9 -1.5
2 26 10 -1
2 28 11 -0.5
2 30 12 0
2 32 13 0.5
The first six rows are for the first person, for which pid=1, and the next sex rows, pid=2 are different observations for the second person.
This should work for you:
z <- replicate(100,
d[d$pid %in% sample(unique(d$pid), 2000, replace=TRUE),],
simplify = FALSE)
The result z will be a list of dataframes you can do whatever with.
EDIT: this is a little wordy, but will deal with duplicated rows. replicate has its obvious use of performing a set operation a given number of times (in the example below, 4). I then sample the unique values of pid (in this case 3 of those values, with replacement) and extract the rows of d corresponding to each sampled value. The combination of a do.call to rbind and lapply deal with the duplicates that are not handled well by the above code. Thus, instead of generating dataframes with potentially different lengths, this code generates a dataframe for each sampled pid and then uses do.call("rbind",...) to stick them back together within each iteration of replicate.
z <- replicate(4, do.call("rbind", lapply(sample(unique(d$pid),3,replace=TRUE),
function(x) d[d$pid==x,])),
simplify=FALSE)

Resources