Assigning data frame column values probabilistically - r

I am trying to create a data frame named "students" with four variables: Gender, Year (Freshman, Sophomore, Junior, Senior), Age, and GPA. The idea is to have a data frame that illustrates the four levels of measurement: nominal, ordinal, interval, and ratio.
At this point it looks something like this:
ID Gender Year Age GPA
1 Male Sophomore 0 3.9
2 Male Junior 0 3.3
3 Female Junior 0 3.6
4 Male Freshman 0 3.1
5 Female Senior 0 2.9
I'm having a problem with Age. I would like Age to be assigned based on a probability. For example, if a student is a Freshman, I'd like Age to be assigned along something like the following lines:
Age Probability
14 .47
15 .48
16 .05
I have a function to do that set up like this:
1: Age <- function(df) {
2: for (i in 1:nrow(df) {
3: if (df[i, 2] == "Freshman") {
4: df[i, 3] = 15
5: } else if {
6: continue through the years
7: }
8: }
9: }
My thinking is that I want to change the right side of the assignment in Line 4 to something that will assign the age probabilistically. That's what I cannot figure out how to do.
On a related note, if there's a better way to do it than what I'm considering, I'd be appreciative of hearing that.
And on a final note, I've Googled the web at large, queried the R forums on Reddit and Talk Stats, and searched the R tags on this site, all to no avail. I can't believe I'm the first person who's ever wanted to do something like this, so it occurs to me that maybe I'm phrasing the query wrong. If that's the case, any guidance there would also be appreciated.

Use sample function like this:
sample(14:16, size=1,prob=c(0.47, 0.48, 0.05))
## [1] 14
sample(14:16, size=10,rep=TRUE,prob=c(0.47, 0.48, 0.05))
## [1] 14 14 15 14 15 16 15 15 15 15

Related

Search for value within a range of values in two separate vectors

This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.
I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)
I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between

Check if a variable is time invariant in R

I tried to search an answer to my question but I find the right answer for Stata (I am using R).
I am using a national survey to study which variables influence the investment in complementary pension (it is voluntary in my country).
The survey is conducted every two years and some individuals are interviewed more than one time. I filtered the df in order to have only the individuals present more than one time trought the filter command. This is an example from the original survey already filtered:
year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 1
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1
2008 4 1972 F 33000 0
2010 4 1972 F 35000 0
where id is the individual, y.b is year of birth, pens is a dummy which takes value 1 if the individual invests in a complementary pension form.
I wanted to run a FE regression so I load the plm package and then I set the df like this:
df.p <- plm.data(df, c("id", "year")
After this command, I expected that constant variables were deleted but after running this regression:
pan1 <- plm (pens ~ woman + age + I(age^2) + high + medium + north + centre, model="within", effect = "individual", data=dd.p, na.action = na.omit)
(where woman is a variable which takes value 1 if the individual is a woman, high, medium refer to education level and north, centre to geographical regions) and after the command summary(pan1) the variable woman is still present.
At this point I think that there are some mistakes in the survey (for example sex was not insert correctly and so it wasn't the same for the same id), so I tried to find a way to check if for each id, sex is constant.
I tried this code but I am sure it is not correct:
df$x <- ifelse(df$id==df$id & df$sex==df$sex,1,0)
the basic idea shuold be like this:
df$x <- ifelse(df$id=="1" & df$sex=="F",1,0)
but I can't do it manually since the df is composed up to 40k observations.
If you know another way to check if a variable is constant in R I will be glad.
Thank you in advance
I think what you are trying to do is calculate the number of unique values of sex for each id. You are hoping it is 1, but any cases of 2 indicate a transcription error. The way to do this in R is
any(by(df$sex,df$id,function(x) length(unique(x))) > 1)
To break that down, the function length(unique(x)) tells you the number of different unique values in a vector. It's similar to levels for a factor (but not identical, since a factor can have levels not present).
The function by calculates the given function on each subset of df$sex according to df$id. In other words, it calculates length(unique(df$sex)) where df$id is 1, then 2, etc.
Lastly, any(... > 1) checks if any of the results are more than one. If they are, the result will be TRUE (and you can use which instead of any to find which ones). If everything is okay, the result will be FALSE.
We can try with dplyr
Example data:
df=data.frame(year=c(2002,2002,2004,2004,2006,2008,2008,2010),
id=c(1,2,1,2,3,3,4,4),
sex=c("F","M","M","M","M","M","F","F"))
Id 1 is both F and M
library(dplyr)
df%>%group_by(id)%>%summarise(sexes=length(unique(sex)))
# A tibble: 4 x 2
id sexes
<dbl> <int>
1 1 2
2 2 1
3 3 1
4 4 1
We can then filter:
df%>%group_by(id)%>%summarise(sexes=length(unique(sex)))%>%filter(sexes==2)
# A tibble: 1 x 2
id sexes
<dbl> <int>
1 1 2

R - create new vectors based on elements of existing vector

and thanks in advance for your help. I am very new to R and am having some trouble with code that, to me looks like it should work, but isn't. I have a data frame like the one below:
studentID classNumber classRating
7 1 4
7 2 4
7 4 3
79 1 5
79 2 3
116 1 5
116 2 4
134 1 5
134 3 5
134 4 5
And I want it to read like this:
Student ID class1 class2 class3 class4
7 4 4 NA 3
79 5 3 NA NA
116 5 4 NA NA
134 5 NA 5 5
I've tried to piece together different things that I've come across and it seemed like the best approach was to create a new data frame and matrix and then populate it from the current data frame. I came up with the broken code below:
classRatings = data.frame(matrix(NA,4,5))
for(i in 1:nrow(classDB)){
#Find ratings by each student
rowsToReplace = classDB$studentID==classRatings$studentID[i]
#Make a row for each unique studentID in classRatings
classDB$studentID[rowsToReplace] = classRatings$studentID[i]
#for each studentID, find put the given rating for each unique class into
#it's own vector
for(j in classDB$classNumber){
if(classDB$classNumber==1){classRatings$class1==classDB$classRating}[j]
if(classDB$classNumber==2){classRatings$class2==classDB$classRating}[j]
if(classDB$classNumber==3){classRatings$class3==classDB$classRating}[j]
if(classDB$classNumber==4){classRatings$class4==classDB$classRating}[j]
if(classDB$classNumber==5){classRatings$class5==classDB$classRating}[j]
}
}
I'm getting an error that says:
the condition has length > 1 and only the first element will be used
and I am beyond my skill level to figure it out. Any help is appreciated.
The tidyr package can spread this long table into a wider one:
library(tidyr)
spread(classDB,classNumber,classRating,fill=NA)

Replace values in one data frame from values in another data frame

I need to change individual identifiers that are currently alphabetical to numerical. I have created a data frame where each alphabetical identifier is associated with a number
individuals num.individuals (g4)
1 ZYO 64
2 KAO 24
3 MKU 32
4 SAG 42
What I need to replace ZYO with the number 64 in my main data frame (g3) and like wise for all the other codes.
My main data frame (g3) looks like this
SAG YOG GOG BES ATR ALI COC CEL DUN EVA END GAR HAR HUX ISH INO JUL
1 2
2 2 EVA
3 SAG 2 EVA
4 2
5 SAG 2
6 2
Now on a small scale I can write a code to change it like I did with ATR
g3$ATR <- as.character(g3$ATR)
g3[g3$target == "ATR" | g3$ATR == "ATR","ATR"] <- 2
But this is time consuming and increased chance of human error.
I know there are ways to do this on a broad scale with NAs
I think maybe we could do a for loop for this, but I am not good enough to write one myself.
I have also been trying to use this function which I feel like may work but I am not sure how to logically build this argument, it was posted on the questions board here
Fast replacing values in dataframe in R
df <- as.data.frame(lapply(df, function(x){replace(x, x <0,0)})
I have tried to work my data into this by
df <- as.data.frame(lapply(g4, function(g3){replace(x, x <0,0)})
Here is one approach using the data.table package:
First, create a reproducible example similar to your data:
require(data.table)
ref <- data.table(individuals=1:4,num.individuals=c("ZYO","KAO","MKU","SAG"),g4=c(64,24,32,42))
g3 <- data.table(SAG=c("","SAG","","SAG"),KAO=c("KAO","KAO","",""))
Here is the ref table:
individuals num.individuals g4
1: 1 ZYO 64
2: 2 KAO 24
3: 3 MKU 32
4: 4 SAG 42
And here is your g3 table:
SAG KAO
1: KAO
2: SAG KAO
3:
4: SAG
And now we do our find and replacing:
g3[ , lapply(.SD,function(x) ref$g4[chmatch(x,ref$num.individuals)])]
And the final result:
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA
And if you need more speed, the fastmatch package might help with their fmatch function:
require(fastmatch)
g3[ , lapply(.SD,function(x) ref$g4[fmatch(x,ref$num.individuals)])]
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA

Select consecutive date entries

I have updated the question, as a) i articulated the question not clearly on the first attempt, b) my exact need also shifted somewhat.
I want to especially thank Hemmo for great help so far - and apologies for not articulating my question clearly enough to him. His code (that addressed earlier version of problem) is shown in the answer section.
At a high-level - i am looking for code that helps to identify and differentiate the different blocks of consecutive free time of different individuals. More specifically - the code would ideally:
Check whehter an activity is labelled as "Free"
Check whether consecutive weeks (week earlier, week later) of time spent by the same person where also labelled as "Free".
Give the entire block of consecutive weeks of that person that are labelled "Free" an indicator in the desired outcome column. Note that the lenght of time-periods (e.g. 1 consec week, 4 consec weeks, 8 consec weeks) will vary
Finally - due to a need for further analysis on the characteristics of these clusters, different blocks should receive different indicators. (e.g. the march block of Paul would have value 1, the May block value 2, and Kim's block in March would be have value 3)
Hopefully this becomes more clear when one looks at the example dataframe (see the desired final column)
Any help much appreciated, code for the test dataframe per below.
Many thanks in advance,
W
Example (note that the last column should be generated by the code, purely included as illustration):
Week Name Activity Hours Desired_Outcome
1 01/01/2013 Paul Free 40 1
2 08/01/2013 Paul Free 10 1
3 08/01/2013 Paul Project A 30 0
4 15/01/2013 Paul Project B 30 0
5 15/01/2013 Paul Project A 10 0
6 22/01/2013 Paul Free 40 2
7 29/01/2013 Paul Project B 40 0
8 05/02/2013 Paul Free 40 3
9 12/02/2013 Paul Free 10 3
10 19/02/2013 Paul Free 30 3
11 01/01/2013 Kim Project E 40 0
12 08/01/2013 Kim Free 40 4
13 15/01/2013 Kim Free 40 4
14 22/01/2013 Kim Project E 40 0
15 29/01/2013 Kim Free 40 5
Code for dataframe:
Name=c(rep("Paul",10),rep("Kim",5))
Week=c("01/01/2013","08/01/2013","08/01/2013","15/01/2013","15/01/2013","22/01/2013","29/01/2013","05/02/2013","12/02/2013","19/02/2013","01/01/2013","08/01/2013","15/01/2013","22/01/2013","29/01/2013")
Activity=c("Free","Free","Project A","Project B","Project A","Free","Project B","Free","Free","Free","Project E","Free","Free","Project E","Free")
Hours=c(40,10,30,30,10,40,40,40,10,30,40,40,40,40,40)
Desired_Outcome=c(1,1,0,0,0,2,0,3,3,3,0,4,4,0,5)
df=as.data.frame(cbind(Week,Name,Activity,Hours,Desired_Outcome))
df
EDIT: This was messy already as the question was edited several times, so I removed old answers.
checkFree<-function(df){
df$Week<-as.Date(df$Week,format="%d/%m/%Y")
df$outcome<-numeric(nrow(df))
if(df$Activity[1]=="Free"){ #check first
counter<-1
df$outcome[1]<-counter
} else counter<-0
for(i in 2:nrow(df)){
if(df$Activity[i]=="Free"){
LastWeek <- (df$Week >= (df$Week[i]-7) &
df$Week < (df$Week[i]))
if(all(df$Activity[LastWeek]!="Free"))
counter<-counter+1
df$outcome[i]<-counter
}
}
df
}
splitdf<-split(df, Name)
df<-unsplit(lapply(splitdf,checkFree),Name)
uniqs<-unique(df2$Name) #for renumbering
for(i in 2:length(uniqs))
df$outcome[df$Name==uniqs[i] & df$outcome>0]<-
max(df$outcome[df$Name==uniqs[i-1]]) +
df$outcome[df$Name==uniqs[i] & df$outcome>0]
df
That should do it, although the above code is probably far from optimal.
Using the comment by user1885116 to Hemmo's answer as a guide to what is desired, here is a somewhat simpler approach:
N <- 1
x <- with(df, df[Activity=='Free',])
y <- with(x, diff(Week)) <= N*7
df$outcome <- 0
df[rownames(x[c(y, FALSE) | c(FALSE, y),]),]$outcome <- 1
df
## Week Activity Hours Desired_Outcome outcome
## 1 2013-01-01 Project A 40 0 0
## 2 2013-01-08 Project A 10 0 0
## 3 2013-01-08 Free 30 1 1
## 4 2013-01-15 Project B 30 0 0
## 5 2013-01-15 Free 10 1 1
## 6 2013-01-22 Project B 40 0 0
## 7 2013-01-29 Free 40 0 0
## 8 2013-02-05 Project C 40 0 0

Resources