get the paired sample in R language [closed] - r
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
X<-scan()
1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1
1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 0 0 1 1 1
1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1
Z<-scan()
-0.05 0.11 -0.01 1.08 0.68 -1.79 -0.12 -0.06 0.17 -1.35 1.55 0.60
-1.42 -1.21 0.97 0.23 0.20 0.89 0.28 0.56 1.02 -0.32 0.20 -1.35
0.53 -0.52 -0.07 -1.07 0.10 0.53 0.97 0.32 -0.07 0.98 -1.23 0.72
-0.09 0.31 1.25 0.60 1.16 -0.98 1.63 0.72 0.24 -0.02 -1.13 0.56
0.78 1.75 -0.01 -0.44 0.47 -0.21 2.06 2.19 -0.94 -0.36 1.35 -1.35
1.50 0.13 -0.20 -0.57 -0.14 -1.34 -1.17 2.04 0.21 1.47 -1.20 -0.60
0.15 -0.64 -0.71 0.24 -0.86 -1.39 -0.63 -1.25 0.40 -0.76 0.73 -0.15
0.09 0.35 -0.19 0.29 0.56 0.82 -0.28 0.63 1.35 -0.04 1.99 1.12
-1.91 0.26 -1.18 -0.10
In the vector X, 0 is control group and 1 is case group.
I want to match this cases and controls based on Z vector.Actually I want to match elements of X based on Z ang get the samples from matched data.
what should I do?
The other answers seem to think that you're looking for subsetting, but I'm assuming (based on your use of the language "case" and "controls") that you're talking about matching in a statistical sense. If so, it sounds like you want something like the functionality provided by the Matching package, like the following:
library(Matching)
out <- Match(Tr=X,X=Z)
out$mdata # list of `Y` outcome vector (if applicable),
# `Tr` treatment vector, and
# `X` matrix of covariates for the matched sample
If you also have an outcome measure, you can specify that in Match and it will give you treatment effect estimates.
There are also other packages to do matching, like MatchIt, cem, and nonrandom (the last of which has apparently been removed from CRAN), depending on what particular matching procedure you're going for.
I suppose you are looking for
Z[as.logical(X)] # case
and
Z[!X] # control
I suppose your question is about subsetting, here is some examples:
# Data
X<-c(1,1,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,0,1,0,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,1,0,1,1,1,1,1,1,0,1,1,1,1,0,0,1,1,0,0,1,1,1,1,1,1,0,1,1,1,1,1,0,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1)
Z<-c(-0.05,0.11,-0.01,1.08,0.68,-1.79,-0.12,-0.06,0.17,-1.35,1.55,0.60,-1.42,-1.21,0.97,0.23,0.20,0.89,0.28,0.56,1.02,-0.32,0.20,-1.35,0.53,-0.52,-0.07,-1.07,0.10,0.53,0.97,0.32,-0.07,0.98,-1.23,0.72,-0.09,0.31,1.25,0.60,1.16,-0.98,1.63,0.72,0.24,-0.02,-1.13,0.56,0.78,1.75,-0.01,-0.44,0.47,-0.21,2.06,2.19,-0.94,-0.36,1.35,-1.35,1.50,0.13,-0.20,-0.57,-0.14,-1.34,-1.17,2.04,0.21,1.47,-1.20,-0.60,0.15,-0.64,-0.71,0.24,-0.86,-1.39,-0.63,-1.25,0.40,-0.76,0.73,-0.15,0.09,0.35,-0.19,0.29,0.56,0.82,-0.28,0.63,1.35,-0.04,1.99,1.12,-1.91,0.26,-1.18,-0.10)
myMatrix <- cbind(X,Z)
# Subsetting
myMatrixControls <- myMatrix[ myMatrix[,1]==0,]
myMatrixCases <- myMatrix[ myMatrix[,1]==1,]
# Example: get sum per group
sumZ_Contolrs <- sum(myMatrix[ myMatrix[,1]==0, 2])
sumZ_Cases <- sum(myMatrix[ myMatrix[,1]==1, 2])
Related
What could be the R code for Making an index?
Suppose we run a regression with dummy variables and we obtain results like Yt=a+b1X1+b2X2+b3X3...+bnXn+C1Dum1+C2Dum2+C3Dum3+...+CnDumN I want to create an index I, such that, I=W1*Dum1+W2*Dum2+W3*Dum3+...+Wn*DumN where Wi's are the weights as the regression coefficients of dummies.
Supposed you have a data frame like this. head(dat) # X1 X2 Y D1 D2 D3 W1 W2 W3 # 1 1.37 -0.31 0.21 0 1 0 -1.04 1.39 -0.03 # 2 -0.56 -1.78 -0.36 0 1 0 -0.09 -0.48 0.11 # 3 0.36 -0.17 0.76 0 1 0 0.62 0.65 -0.49 # 4 0.63 1.21 -0.73 0 1 0 -0.95 1.39 -0.50 # 5 0.40 1.90 -1.37 0 1 0 -0.54 -1.11 -1.66 # 6 -0.11 -0.43 0.43 0 1 0 0.58 -0.86 -0.38 You may use which.max to subset. dat <- transform(dat, weight=apply(dat, 1, function(x) x[weights][which.max(x[dummies])])) dat # X1 X2 Y D1 D2 D3 W1 W2 W3 weight # 1 1.37 -0.31 0.21 0 1 0 -1.04 1.39 -0.03 1.39 # 2 -0.56 -1.78 -0.36 0 1 0 -0.09 -0.48 0.11 -0.48 # 3 0.36 -0.17 0.76 0 1 0 0.62 0.65 -0.49 0.65 # 4 0.63 1.21 -0.73 0 1 0 -0.95 1.39 -0.50 1.39 # 5 0.40 1.90 -1.37 0 1 0 -0.54 -1.11 -1.66 -1.11 # 6 -0.11 -0.43 0.43 0 1 0 0.58 -0.86 -0.38 -0.86 # 7 1.51 -0.26 -0.81 0 0 1 0.77 -1.13 -0.51 -0.51 # 8 -0.09 -1.76 1.44 0 0 1 0.46 -1.46 2.70 2.70 # 9 2.02 0.46 -0.43 1 0 0 -0.89 0.08 -1.36 -0.89 # 10 -0.06 -0.64 0.66 0 1 0 -1.10 0.65 0.14 0.65 # 11 1.30 0.46 0.32 0 0 1 1.51 1.20 -1.49 -1.49 # 12 2.29 0.70 -0.78 0 1 0 0.26 1.04 -1.47 1.04 # 13 -1.39 1.04 1.58 0 1 0 0.09 -1.00 0.12 -1.00 # 14 -0.28 -0.61 0.64 0 0 1 -0.12 1.85 -1.00 -1.00 # 15 -0.13 0.50 0.09 0 0 1 -1.19 -0.67 0.00 0.00 # 16 0.64 -1.72 0.28 0 1 0 0.61 0.11 -0.43 0.11 # 17 -0.28 -0.78 0.68 0 0 1 -0.22 -0.42 -0.61 -0.61 # 18 -2.66 -0.85 0.09 1 0 0 -0.18 -0.12 -2.02 -0.18 # 19 -2.44 -2.41 -2.99 0 0 1 0.93 0.19 -1.22 -1.22 # 20 1.32 0.04 0.28 0 1 0 0.82 0.12 0.18 0.12 Data: I show here how I created the example data, which probably answers additional questions. set.seed(42) dat <- data.frame(matrix(round(rnorm(60), 2), 20, 3)) dat$X4 <- rbinom(20, 2, .5) names(dat)[3] <- "Y" dat <- cbind(dat[-4], setNames(data.frame(model.matrix(X1 ~ 0 + factor(X4), dat)), paste0("D", 1:3))) dat <- cbind(dat, setNames(data.frame(matrix(round(rnorm(60), 2), 20, 3)), paste0("W", 1:3)))
Creating new variable in wide data format, R
I have transformed my data into a wide format using the mlogit.data function in order to be able to perform an mlogit multinomial logit regression in R. The data has three different "choices" and looks like this (in its wide format): Observation Choice Variable A Variable B Variable C 1 1 1.27 0.2 0.81 1 0 1.27 0.2 0.81 1 -1 1.27 0.2 0.81 2 1 0.20 0.45 0.70 2 0 0.20 0.45 0.70 2 -1 0.20 0.45 0.70 However, as the variables A, B and C are linked to the different outcomes I would now like to create a new variable that looks like this: Observation Choice Variable A Variable B Variable C Variable D 1 1 1.27 0.2 0.81 1.27 1 0 1.27 0.2 0.81 0.2 1 -1 1.27 0.2 0.81 0.81 2 1 0.20 0.45 0.70 0.20 2 0 0.20 0.45 0.70 0.45 2 -1 0.20 0.45 0.70 0.70 I have tried the following code: Variable D <- ifelse(Choice == "1", Variable A, ifelse(Choice == "-1", Variable B, Variable C)) However, the ifelse function only considers one choice from each observation, creating this: Observation Choice Variable A Variable B Variable C Variable D 1 1 1.27 0.2 0.81 1.27 1 0 1.27 0.2 0.81 - 1 -1 1.27 0.2 0.81 - 2 1 0.20 0.45 0.70 - 2 0 0.20 0.45 0.70 0.2 2 -1 0.20 0.45 0.70 - Anyone know how to solve this? Thanks!
You can create a table mapping choices to variables and then use match choice_map <- data.frame(choice = c(1, 0, -1), var = grep('Variable[A-C]', names(df))) # choice var # 1 1 3 # 2 0 4 # 3 -1 5 df$VariableD <- df[cbind(seq_len(nrow(df)), with(choice_map, var[match(df$Choice, choice)]))] df # Observation Choice VariableA VariableB VariableC VariableD # 1 1 1 1.27 0.20 0.81 1.27 # 2 1 0 1.27 0.20 0.81 0.20 # 3 1 -1 1.27 0.20 0.81 0.81 # 4 2 1 0.20 0.45 0.70 0.20 # 5 2 0 0.20 0.45 0.70 0.45 # 6 2 -1 0.20 0.45 0.70 0.70 Data used (removed spaces in colnames) df <- data.table::fread(' Observation Choice VariableA VariableB VariableC 1 1 1.27 0.2 0.81 1 0 1.27 0.2 0.81 1 -1 1.27 0.2 0.81 2 1 0.20 0.45 0.70 2 0 0.20 0.45 0.70 2 -1 0.20 0.45 0.70 ', data.table = F)
df$`Variable D`= sapply(1:nrow(df),function(x){ df[x,4-df$Choice[x]] }) > df Observation Choice Variable A Variable B Variable C Variable D 1 1 1 1.27 0.20 0.81 1.27 2 1 0 1.27 0.20 0.81 0.20 3 1 -1 1.27 0.20 0.81 0.81 4 2 1 0.20 0.45 0.70 0.20 5 2 0 0.20 0.45 0.70 0.45 6 2 -1 0.20 0.45 0.70 0.70
Cumulative sum based on factor on R
I have the following dataset, and I need to acumulate the value and sum, if the factor is 0, and then put the cummulated sum when I found the factor != 0. I've tried the loop bellow, but it didn't worked at all. for(i in dataset$Variable.1) { ifelse(dataset$Factor == 0, dataset$teste <- dataset$Variable.1 + i, dataset$teste <- dataset$Variable.1) i<- dataset$Variable.1 print(i) } Any ideas? Bellow an example of the dataset. I wish to get the "Result" Column. On the real one, I also have a negative factor (-1). Date Factor Variable.1 Result 1 03/02/2018 0 0.75 0.75 2 04/02/2018 0 0.75 1.50 3 05/02/2018 1 0.96 2.46 4 06/02/2018 1 0.76 0.76 5 07/02/2018 0 1.35 1.35 6 08/02/2018 1 0.70 2.05 7 09/02/2018 1 2.02 2.02 8 10/02/2018 0 0.00 0.00 9 11/02/2018 0 0.00 0.00 10 12/02/2018 0 0.20 0.20 11 13/02/2018 0 0.13 0.33 12 14/02/2018 0 1.64 1.97 13 15/02/2018 0 0.03 2.00 14 16/02/2018 1 0.51 2.51 15 17/02/2018 1 0.00 0.00 16 18/02/2018 0 0.00 0.00 17 19/02/2018 0 0.83 0.83 18 20/02/2018 1 0.42 1.25 19 21/02/2018 1 0.17 0.17 20 22/02/2018 1 0.97 0.97 21 23/02/2018 0 0.92 0.92 22 24/02/2018 0 0.00 0.92 23 25/02/2018 0 0.00 0.92 24 26/02/2018 1 0.19 1.11 25 27/02/2018 1 0.87 0.87 26 28/02/2018 1 0.85 0.85 27 01/03/2018 1 1.95 1.95 28 02/03/2018 1 0.54 0.54 29 03/03/2018 1 0.00 0.00 30 04/03/2018 0 0.00 0.00 31 05/03/2018 0 1.17 1.17 32 06/03/2018 1 0.25 1.42 33 07/03/2018 1 1.45 1.45 Thanks In advance.
If you want to stick with the for-loop, you can try this code : DF$Result <- NA prev <- 0 for(i in seq_len(nrow(DF))){ DF$Result[i] <- DF$Variable.1[i] + prev if(DF$Factor[i] == 1) prev <- 0 else prev <- DF$Result[i] }
Iteratively, try something like: a=as.data.frame(cbind(Factor=c(0,0,1,1,0,1,1, rep(0,3),1),Variable.1=c(0.75,0.75,0.96,0.71,1.35,0.7, 0.75,0.96,0.71,1.35,0.7))) Result=0 aux=NULL for (i in 1:nrow(a)){ if (a$Factor[i]==0){ Result=Result+a$Variable.1[i] aux=c(aux,Result) } else{ Result=Result+a$Variable.1[i] aux=c(aux,Result) Result=0 } } a$Results=aux a Factor Variable.1 Results 1 0 0.75 0.75 2 0 0.75 1.50 3 1 0.96 2.46 4 1 0.71 0.71 5 0 1.35 1.35 6 1 0.70 2.05 7 1 0.75 0.75 8 0 0.96 0.96 9 0 0.71 1.67 10 0 1.35 3.02 11 1 0.70 3.72
A possibility using tidyverse and data.table: df %>% mutate(temp = ifelse(Factor == 1 & lag(Factor) == 1, NA, 1), #Marking the rows after the first 1 in "Factor" as NA temp = ifelse(!is.na(temp), rleid(temp), NA)) %>% #Run length along non-NA values group_by(temp) %>% #Grouping by run length mutate(Result = ifelse(!is.na(temp), cumsum(Variable.1), Variable.1)) %>% #Cumulative sum of desired rows ungroup() %>% select(-temp) #Removing the redundant variable Date Factor Variable.1 Result <chr> <int> <dbl> <dbl> 1 03/02/2018 0 0.750 0.750 2 04/02/2018 0 0.750 1.50 3 05/02/2018 1 0.960 2.46 4 06/02/2018 1 0.760 0.760 5 07/02/2018 0 1.35 1.35 6 08/02/2018 1 0.700 2.05 7 09/02/2018 1 2.02 2.02 8 10/02/2018 0 0. 0. 9 11/02/2018 0 0. 0. 10 12/02/2018 0 0.200 0.200
How to count these transitions - in R
Given a table of values, where A = state of system, B = length of state, and C = cumulative length of states: A B C 1 1.16 1.16 0 0.51 1.67 1 1.16 2.84 0 0.26 3.10 1 0.59 3.69 0 0.39 4.08 1 0.78 4.85 0 0.90 5.75 1 0.78 6.53 0 0.26 6.79 1 0.12 6.91 0 0.51 7.42 1 0.26 7.69 0 0.51 8.20 1 0.39 8.59 0 0.51 9.10 1 1.16 10.26 0 1.10 11.36 1 0.59 11.95 0 0.51 12.46 How would I use R to calculate the number of transitions (where A gives the state) per constant interval length - where the intervals are consecutive and could be any arbitrary number (I chose a value of 2 in my image example)? For example, using the table values or the image included we count 2 transitions from 0-2, 3 transitions from greater than 2-4, 3 transitions from >4-6, etc.
This is straightforward in R. All you need is column C and ?cut. Consider: d <- read.table(text="A B C 1 1.16 1.16 0 0.51 1.67 1 1.16 2.84 0 0.26 3.10 1 0.59 3.69 0 0.39 4.08 1 0.78 4.85 0 0.90 5.75 1 0.78 6.53 0 0.26 6.79 1 0.12 6.91 0 0.51 7.42 1 0.26 7.69 0 0.51 8.20 1 0.39 8.59 0 0.51 9.10 1 1.16 10.26 0 1.10 11.36 1 0.59 11.95 0 0.51 12.46", header=TRUE) fi <- cut(d$C, breaks=seq(from=0, to=14, by=2)) table(fi) # fi # (0,2] (2,4] (4,6] (6,8] (8,10] (10,12] (12,14] # 2 3 3 5 3 3 1
R :How to execute FOR loop for Kmeans
I have an input file with Format as below : RN KEY MET1 MET2 MET3 MET4 1 1 0.11 0.41 0.91 0.17 2 1 0.94 0.02 0.17 0.84 3 1 0.56 0.64 0.46 0.7 4 1 0.57 0.23 0.81 0.09 5 2 0.82 0.67 0.39 0.63 6 2 0.99 0.90 0.34 0.84 7 2 0.83 0.01 0.70 0.29 I have to execute Kmeans in R -separately for DF with Key=1 and Key=2 and so on... Afterwards the final output CSV should look like RN KEY MET1 MET2 MET3 MET4 CLST 1 1 0.11 0.41 0.91 0.17 1 2 1 0.94 0.02 0.17 0.84 1 3 1 0.56 0.64 0.46 0.77 2 4 1 0.57 0.23 0.81 0.09 2 5 2 0.82 0.67 0.39 0.63 1 6 2 0.99 0.90 0.34 0.84 2 7 2 0.83 0.01 0.70 0.29 2 Ie Key=1 is to be treated as separate DF and Key=2 is be treated as separate DF and so on... Finally the output of clustering (of each DF)is to be combined with Key column first (since Key cannot participate in clustering) and then combined with each different DF for final output In the above example : DF1 is KEY MET1 MET2 MET3 MET4 1 0.11 0.41 0.91 0.17 1 0.94 0.02 0.17 0.84 1 0.56 0.64 0.46 0.77 1 0.57 0.23 0.81 0.09 DF2 is KEY MET1 MET2 MET3 MET4 2 0.82 0.67 0.39 0.63 2 0.99 0.90 0.34 0.84 2 0.83 0.01 0.70 0.29 Please suggest how to achieve in R Psuedo code : n<-Length(unique(Mydf$key)) for i=1 to n { #fetch partial df for each value of Key and run K means dummydf<-subset(mydf,mydf$key=i KmeansIns<-Kmeans(dummydf,2) # combine with cluster result dummydf<-data.frame(dummydf,KmeansIns$cluster) # combine each smalldf into final Global DF finaldf<-data.frame(finaldf,dummydf) }Next i #Now we have finaldf then it can be written to file
I think the easiest way would be to use by. Something along the lines of by(data = DF, INDICES = DF$KEY, FUN = function(x) { # your clustering code here }) where x is a subset of your DF for each KEY.
A solution using data.tables. library(data.table) setDT(DF)[,CLST:=kmeans(.SD, centers=2)$clust, by=KEY, .SDcols=3:6] DF # RN KEY MET1 MET2 MET3 MET4 CLST # 1: 1 1 0.11 0.41 0.91 0.17 2 # 2: 2 1 0.94 0.02 0.17 0.84 1 # 3: 3 1 0.56 0.64 0.46 0.70 1 # 4: 4 1 0.57 0.23 0.81 0.09 2 # 5: 5 2 0.82 0.67 0.39 0.63 2 # 6: 6 2 0.99 0.90 0.34 0.84 2 # 7: 7 2 0.83 0.01 0.70 0.29 1
#Read data mdf <- read.table("mydat.txt", header=T) #Convert to list based on KEY column mls <- split(mdf, f=mdf$KEY) #Define columns to use in clustering myv <- paste0("MET", 1:4) #Cluster each df item in list : modify kmeans() args as appropriate kls <- lapply(X=mls, FUN=function(x){x$clust <- kmeans(x[, myv], centers=2)$cluster ; return(x)}) #Make final "global" df finaldf <- do.call(rbind, kls)