R creating variable with satisfying condition - r

Help sought from anyone.
I have a household survey data set named h2004 and would like to create a variable equals to another variable that satisfy certain condition. Here I have put a sample of observations.
cq15 expen
10 0.4616136
10 1.538712
11 2.308068
11 0.384678
12 2.576797822
12 5.5393632
13 5.4624276
14 2.6158104
14 20.157127
and I tried the following command:
h2004$crops[h2004$cq15>=12 & h2004$cq15<=14]=h2004$expen
and this produces wrong results in R as I know the correct result from using Stata. In the original data set, the above command takes values of 'expen' even when cq15<12 and replaces the observations against cq15>=12 & cq15<=14.
I also tried with filter option of dplyr that correctly subset the data frame but don't know how to apply it to specific variable.
fil<- filter(h2004, cq15>=12 & cq15<=14)
I think my subsetting (cq15>=12 & cq15<=14) is wrong. Please advice. Thanks

The problem is in the command. When the command is executed, the following warning message is issued:
Warning message:
In h2004$crops[h2004$cq15 >= 12 & h2004$cq15 <= 14] = h2004$expen :
number of items to replace is not a multiple of replacement length
The reason for this is that the LHS of this command selects elements satisfying condition h2004$cq15 >= 12 & h2004$cq15 <= 14 while on the RHS, the complete vector h2004$expen is given causing mismatch in length.
Solution:
> h2004$crops[h2004$cq15>=12 & h2004$cq15<=14]=h2004$expen[h2004$cq15>=12 & h2004$cq15<=14]
> h2004
cq15 expen crops
1 10 0.4616136 NA
2 10 1.5387120 NA
3 11 2.3080680 NA
4 11 0.3846780 NA
5 12 2.5767978 2.576798
6 12 5.5393632 5.539363
7 13 5.4624276 5.462428
8 14 2.6158104 2.615810
9 14 20.1571270 20.157127
or Alternatively:
> indices <- which(h2004$cq15>=12 & h2004$cq15<=14)
> h2004$crops[indices] = h2004$expen[indices]
> h2004
cq15 expen crops
1 10 0.4616136 NA
2 10 1.5387120 NA
3 11 2.3080680 NA
4 11 0.3846780 NA
5 12 2.5767978 2.576798
6 12 5.5393632 5.539363
7 13 5.4624276 5.462428
8 14 2.6158104 2.615810
9 14 20.1571270 20.157127

Related

How do I create a column using values of a second column that meet the conditions of a third in R?

I have a dataset Comorbidity in RStudio, where I have added columns such as MDDOnset, and if the age at onset of MDD < the onset of OUD, it equals 1, and if the opposite is true, then it equals 2. I also have another column PhysDis that has values 0-100 (numeric in nature).
What I want to do is make a new column that includes the values of PhysDis, but only if MDDOnset == 1, and another if MDDOnset==2. I want to make these columns so that I can run a t-test on them and compare the two groups (those with MDD prior OUD, and those who had MDD after OUD with regards to which group has a greater physical disability score). I want any case where MDDOnset is not 1 to be NA.
ttest1 <-t.test(Comorbidity$MDDOnset==1, Comorbidity$PhysDis)
ttest2 <-t.test(Comorbidity$MDDOnset==2, Comorbidity$PhysDis)
When I did the t test twice, once where MDDOnset = 1 and another when it equaled 2, the mean for y (Comorbidity$PhysDis) was the same, and when I looked into the original csv file, it turned out that this mean was the mean of the entire column, and not just cases where MDDOnset had a value of one or two. If there is a different way to run the t-tests that would have the mean of PhysDis only when MDDOnset = 1, and another with the mean of PhysDis only when MDDOnset == 2 that does not require making new columns, then please tell me.. Sorry if there are any similar questions or if my approach is way off, I'm new to R and programming in general, and thanks in advance.
Here's a smaller data frame where I tried to replicate the error where the new columns have switched lengths. The issue would be that the length of C would be 4, and the length of D would be 6 if I could replicate the error.
> A <- sample(1:10)
> B <-c(25,34,14,76,56,34,23,12,89,56)
> alphabet <-data.frame(A,B)
> alphabet$C <-ifelse(alphabet$A<7, alphabet$B, NA)
> alphabet$D <-ifelse(alphabet$A>6, alphabet$B, NA)
> print(alphabet)
A B C D
1 7 25 NA 25
2 9 34 NA 34
3 4 14 14 NA
4 2 76 76 NA
5 5 56 56 NA
6 10 34 NA 34
7 8 23 NA 23
8 6 12 12 NA
9 1 89 89 NA
10 3 56 56 NA
> length(which(alphabet$C>0))
[1] 6
> length(which(alphabet$D>0))
[1] 4
I would use the mutate command from the dplyr package.
Comorbidity <- mutate(Comorbidity, newColumn = (ifelse(MDDOnset == 1, PhysDis, "")), newColumn2 = (ifelse(MDDOnset == 2, PhysDis, "")))

chaining together sequential observations with only current and immediately prior ID values in R

Say I have some data on traits of individuals measured over time, that looks like this:
present <- c(1:4)
pre.1 <- c(5:8)
pre.2 <- c(9:12)
present2 <- c(13:16)
id <- c(present,pre.1,pre.2,present2)
prev.id <- c(pre.1,pre.2,rep(NA,8))
trait <- rnorm(16,10,3)
d <- data.frame(id,prev.id,trait)
print d:
id prev.id trait
1 1 5 10.693266
2 2 6 12.059654
3 3 7 3.594182
4 4 8 14.411477
5 5 9 10.840814
6 6 10 13.712924
7 7 11 11.258689
8 8 12 10.920899
9 9 NA 14.663039
10 10 NA 5.117289
11 11 NA 8.866973
12 12 NA 15.508879
13 13 NA 14.307738
14 14 NA 15.616640
15 15 NA 10.275843
16 16 NA 12.443139
Every observations has a unique value of id. However, some individuals have been observed in the past, and so I also have an observation of prev.id. This allows me to connect an individual with its current and past values of trait. However, some individuals have been remeasured multiple times. Observations 1-4 have previous IDs of 5-8, and observations of 5-8 have previous IDs of 9-12. Observations 9-12 have no previous ID because this is the first time these were measured. Furthermore, observations 13-16 have never been measured before. So, observations 1:4 are unique individuals, observations 5-12 are prior observations of individuals 1-4, and observations 13-16 are another set of unqiue individuals, distinct from 1-4. I would like to write code to generate a table that has every unique individual, as well as every past observation of that individuals trait. The final output would look like:
id <- c(1:4,13:16)
prev.id <- c(5:8, rep(NA,4))
trait <- d$trait[c(1:4,13:16)]
prev.trait.1 <- d$trait[c(5:8 ,rep(NA,4))]
prev.trait.2 <- d$trait[c(9:12,rep(NA,4))]
output<- data.frame(id,prev.id,trait,prev.trait.1,prev.trait.2)
> output
id prev.id trait prev.trait.1 prev.trait.2
1 1 5 10.693266 10.84081 14.663039
2 2 6 12.059654 13.71292 5.117289
3 3 7 3.594182 11.25869 8.866973
4 4 8 14.411477 10.92090 15.508879
5 13 NA 14.307738 NA NA
6 14 NA 15.616640 NA NA
7 15 NA 10.275843 NA NA
8 16 NA 12.443139 NA NA
I can accomplish this in a straightforward manner, but it requires me coding an additional pairing for each previous observation, such that the number of code groups I need to write is the number of times any individual has been recorded. This is a pain, as in the data set I am applying this problem to, there may be anywhere from 0-100 previous observations of an individual.
#first pairing
d.prev <- data.frame(d$id,d$trait,d$prev.id)
colnames(d.prev) <- c('prev.id','prev.trait.1','prev.id.2')
d <- merge(d,d.prev, by = 'prev.id',all.x=T)
#second pairing
d.prev2 <- data.frame(d$id,d$trait,d$prev.id)
colnames(d.prev2) <- c('prev.id.2','prev.trait.2','prev.id.3')
d<- merge(d,d.prev2,by='prev.id.2',all.x=T)
#remove observations that are another individuals previous observation
d <- d[!(d$id %in% d$prev.id),]
How can I go about doing this in fewer lines, so I don't need 100 code chunks to cover individuals that have been remeasured 100 times?
What you have is a forest of linear lists. We'll start at the terminal ends
roots<-d$id[is.na(d$prev.id)]
And determine the paths backwards
path <- function(node) {
a <- integer(nrow(d))
i <- 0
while(!is.na(node)) {
i <- i+1
a[i] <- node
node <- d$id[match(node,d$prev.id)]
}
return(rev(a[1:i]))
}
Then we can get a 'stacked' representation of your desired output with
x<-do.call(rbind,lapply(roots,
function(r) {p<-path(r); data.frame(id=p[[1]],seq=seq_along(p),traits=d$trait[p])}))
And then use reshape2::dcast to get it in the desired shape
library(reshape2)
dcast(x,id~seq,fill=NA,value.var='traits')
id 1 2 3
1 1 10.693266 10.84081 14.663039
2 2 12.059654 13.71292 5.117289
3 3 3.594182 11.25869 8.866973
4 4 14.411477 10.92090 15.508879
5 13 14.307738 NA NA
6 14 15.616640 NA NA
7 15 10.275843 NA NA
8 16 12.443139 NA NA
I leave it to you to adapt column names.

Converting R data frame with RDS package: recruitment id error?

I am using the RDS package for respondent-driven sampling survey data. I want to convert a regular R data frame to an rds.data.frame. To do so, I have been trying to use the as.rds.data.frame function from RDS.
Here is an excerpted section of my data frame, where the first case (id=1) is the 'seed' respondent (who has no recruiter). It contains the variables: id (respondent id number), recruit.id(id number of respondent who recruited him/her), netsize (respondent's network size) and population (estimate of whole population size).
df<-data.frame(id=c(1,2,3,4,5,6,7,8,9,10),
recruit.id=c(-1,1,1,2,2,4,5,3,8,3),
netsize=c(6,6,6,5,5,4,4,3,4,6), population=rep(22,000, 10))
I then (try to) apply the relevant function:
new.df <-as.rds.data.frame(df,id=df$id,
recruiter.id=df$recruit.id,
network.size=df$netsize,
population.size=df$population,
max.coupons=2)
I get the error message:
Error in as.rds.data.frame(df, id = df$id, recruiter.id = df$recruit.id,: Invalid id
and the warning
In addition: Warning message:In if (!(id %in% names(x))) stop("Invalid id") :
the condition has length > 1 and only the first element will be used
I have tried assigning various 'recruiter id' values for seed participants, including -1,0 or their own id number but I still get the same message. I have also tried eliminating function arguments (coupon.max, population) or deleting seed respondents, but I still get the same message.
Package documentation says the function will fail if recruitment information is incomplete. As far as I can tell, this is not the case.
I am new to this, so if anyone can point me in the right direction I would be really grateful.
This seems to work:
colnames(df)[2:4] <- c("recruiter.id", "network.size.variable", "population.size")
as.rds.data.frame(df,max.coupons=2)
This gives a result with a warning
as.rds.data.frame(df, id="id", recruiter.id="recruit.id",
network.size="netsize", population.size="population", max.coupons=2)
# An object of class "rds.data.frame"
#id: 1 2 3 4 5 6 7 8 9 10
#recruiter.id: -1 1 1 2 2 4 5 3 8 3
# id recruit.id netsize population
#1 1 -1 6 22
#2 2 1 6 22
#3 3 1 6 22
#4 4 2 5 22
#5 5 2 5 22
#6 6 4 4 22
#7 7 5 4 22
#8 8 3 3 22
#9 9 8 4 22
#10 10 3 6 22
# Warning message:
#In as.rds.data.frame(df, id = "id", recruiter.id = "recruit.id", :
#NAs introduced by coercion

looping over the name of the columns in R for creating new columns

I am trying to use the loop over the column names of the existing dataframe and then create new columns based on one of the old column.Here is my sample data:
sample<-list(c(10,12,17,7,9,10),c(NA,NA,NA,10,12,13),c(1,1,1,0,0,0))
sample<-as.data.frame(sample)
colnames(sample)<-c("x1","x2","D")
>sample
x1 x2 D
10 NA 1
12 NA 1
17 NA 1
7 10 0
9 20 0
10 13 0
Now, I am trying to use for loop to generate two variables x1.imp and x2.imp that have values related to D=0 when D=1 and values related to D=1 when D=0(Here I actually don't need for loop but for my original dataset with large cols (variables), I really need the loop) based on the following condition:
for (i in names(sample[,1:2])){
sample$i.imp<-with (sample, ifelse (D==1, i[D==0],i[D==1]))
i=i+1
return(sample)
}
Error in i + 1 : non-numeric argument to binary operator
However, the following works, but it doesn't give the names of new cols as imp.x2 and imp.x3
for(i in sample[,1:2]){
impt.i<-with(sample,ifelse(D==1,i[D==0],i[D==1]))
i=i+1
print(as.data.frame(impt.i))
}
impt.i
1 7
2 9
3 10
4 10
5 12
6 17
impt.i
1 10
2 12
3 13
4 NA
5 NA
6 NA
Note that I already know the solution without loop [here]. I want with loop.
Expected output:
x1 x2 D x1.impt x2.imp
10 NA 1 7 10
12 NA 1 9 20
17 NA 1 10 13
7 10 0 10 NA
9 20 0 12 NA
10 13 0 17 NA
I would greatly appreciate your valuable input in this regard.
This is nuts, but since you are asking for it... Your code with minimum changes would be:
for (i in colnames(sample)[1:2]){
sample[[paste0(i, '.impt')]] <- with(sample, ifelse(D==1, get(i)[D==0],get(i)[D==1]))
}
A few comments:
replaced names(sample[,1:2]) with the more elegant colnames(sample)[1:2]
the $ is for interactive usage. Instead, when programming, i.e. when the column name is to be interpreted, you need to use [ or [[, hence I replaced sample$i.imp with sample[[paste0(i, '.impt')]]
inside with, i[D==0] will not give you x1[D==0] when i is "x1", hence the need to dereference it using get.
you should not name your data.frame sample as it is also the name of a pretty common function
This should work,
test <- sample[,"D"] == 1
for (.name in names(sample)[1:2]){
newvar <- paste(.name, "impt", sep=".")
sample[[newvar]] <- ifelse(test, sample[!test, .name],
sample[test, .name])
}
sample

Creating a numerical variable order

I have a set of data with 3 columns: index column (with no name), colour, colour of seed, and germination time.
How do I create a numerical variable called 'order' with values 1 to 22 (the number of data sets)?
I don't know if I get you right, but simplest way would be:
> order <- c(1:22)
> order
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
No, if you run:
class(order)
you will get:
[1] "integer"
but you can easily get every element of object order (especially in a loop)
for(i in 1:length(order)){
print(order[i])
}

Resources