Creating new variable in wide data format, R - r

I have transformed my data into a wide format using the mlogit.data function in order to be able to perform an mlogit multinomial logit regression in R. The data has three different "choices" and looks like this (in its wide format):
Observation Choice Variable A Variable B Variable C
1 1 1.27 0.2 0.81
1 0 1.27 0.2 0.81
1 -1 1.27 0.2 0.81
2 1 0.20 0.45 0.70
2 0 0.20 0.45 0.70
2 -1 0.20 0.45 0.70
However, as the variables A, B and C are linked to the different outcomes I would now like to create a new variable that looks like this:
Observation Choice Variable A Variable B Variable C Variable D
1 1 1.27 0.2 0.81 1.27
1 0 1.27 0.2 0.81 0.2
1 -1 1.27 0.2 0.81 0.81
2 1 0.20 0.45 0.70 0.20
2 0 0.20 0.45 0.70 0.45
2 -1 0.20 0.45 0.70 0.70
I have tried the following code:
Variable D <- ifelse(Choice == "1", Variable A, ifelse(Choice == "-1", Variable B, Variable C))
However, the ifelse function only considers one choice from each observation, creating this:
Observation Choice Variable A Variable B Variable C Variable D
1 1 1.27 0.2 0.81 1.27
1 0 1.27 0.2 0.81 -
1 -1 1.27 0.2 0.81 -
2 1 0.20 0.45 0.70 -
2 0 0.20 0.45 0.70 0.2
2 -1 0.20 0.45 0.70 -
Anyone know how to solve this?
Thanks!

You can create a table mapping choices to variables and then use match
choice_map <-
data.frame(choice = c(1, 0, -1), var = grep('Variable[A-C]', names(df)))
# choice var
# 1 1 3
# 2 0 4
# 3 -1 5
df$VariableD <-
df[cbind(seq_len(nrow(df)), with(choice_map, var[match(df$Choice, choice)]))]
df
# Observation Choice VariableA VariableB VariableC VariableD
# 1 1 1 1.27 0.20 0.81 1.27
# 2 1 0 1.27 0.20 0.81 0.20
# 3 1 -1 1.27 0.20 0.81 0.81
# 4 2 1 0.20 0.45 0.70 0.20
# 5 2 0 0.20 0.45 0.70 0.45
# 6 2 -1 0.20 0.45 0.70 0.70
Data used (removed spaces in colnames)
df <- data.table::fread('
Observation Choice VariableA VariableB VariableC
1 1 1.27 0.2 0.81
1 0 1.27 0.2 0.81
1 -1 1.27 0.2 0.81
2 1 0.20 0.45 0.70
2 0 0.20 0.45 0.70
2 -1 0.20 0.45 0.70
', data.table = F)

df$`Variable D`= sapply(1:nrow(df),function(x){
df[x,4-df$Choice[x]]
})
> df
Observation Choice Variable A Variable B Variable C Variable D
1 1 1 1.27 0.20 0.81 1.27
2 1 0 1.27 0.20 0.81 0.20
3 1 -1 1.27 0.20 0.81 0.81
4 2 1 0.20 0.45 0.70 0.20
5 2 0 0.20 0.45 0.70 0.45
6 2 -1 0.20 0.45 0.70 0.70

Related

R: need help matching up table rows and getting differences

I have chromatographic data in a table organized by peak position and integration value of various samples. All samples in the table have a repeated measurement as well with a different sample log number.
What I'm interested in, is the repeatability of the measurements of the various peaks. The measure for that would be the difference in peak integration = 0 for each sample.
The data
Sample Log1 Log2 Peak1 Peak2 Peak3 Peak4 Peak5
A 100 104 0.20 0.80 0.30 0.00 0.00
B 101 106 0.25 0.73 0.29 0.01 0.04
C 102 103 0.20 0.80 0.30 0.00 0.07
C 103 102 0.22 0.81 0.31 0.04 0.00
A 104 100 0.21 0.70 0.33 0.00 0.10
B 106 101 0.20 0.73 0.37 0.00 0.03
with Log1 is the original sample log number, and Log2 is the repeat log number.
How can I construct a new variable for every peak (being the difference PeakX_Log1 - PeakX_Log2)?
Mind that in my example I only have 5 peaks. The real-life situation is a complex mixture involving >20 peaks, so very hard to do it by hand.
If you will only have two values for each sample, something like this could work:
df <- data.table::fread(
"Sample Log1 Log2 Peak1 Peak2 Peak3 Peak4 Peak5
A 100 104 0.20 0.80 0.30 0.00 0.00
B 101 106 0.25 0.73 0.29 0.01 0.04
C 102 103 0.20 0.80 0.30 0.00 0.07
C 103 102 0.22 0.81 0.31 0.04 0.00
A 104 100 0.21 0.70 0.33 0.00 0.10
B 106 101 0.20 0.73 0.37 0.00 0.03"
)
library(tidyverse)
new_df <- df %>%
mutate(Log = ifelse(Log1 < Log2,"Log1","Log2")) %>%
select(-Log1,-Log2) %>%
pivot_longer(cols = starts_with("Peak"),names_to = "Peak") %>%
pivot_wider(values_from = value, names_from = Log) %>%
mutate(Variation = Log1 - Log2)
new_df
# A tibble: 15 × 5
Sample Peak Log1 Log2 Variation
<chr> <chr> <dbl> <dbl> <dbl>
1 A Peak1 0.2 0.21 -0.0100
2 A Peak2 0.8 0.7 0.100
3 A Peak3 0.3 0.33 -0.0300
4 A Peak4 0 0 0
5 A Peak5 0 0.1 -0.1
6 B Peak1 0.25 0.2 0.05
7 B Peak2 0.73 0.73 0
8 B Peak3 0.29 0.37 -0.08
9 B Peak4 0.01 0 0.01
10 B Peak5 0.04 0.03 0.01
11 C Peak1 0.2 0.22 -0.0200
12 C Peak2 0.8 0.81 -0.0100
13 C Peak3 0.3 0.31 -0.0100
14 C Peak4 0 0.04 -0.04
15 C Peak5 0.07 0 0.07

Add values of multiple dataframes together cell by cell

I am trying to add multiple dataframes together but not in a bind fashion.
Is there an easy way to overlay & add dataframes on top of each other? As shown in this picture:
The number of columns will always be same; the row count will differ.
I want to sum the cells by row position. So Result[1,1] = Table1[1,1] + Table2[1,1] and so on, such that the resulting frame adds whatever cells have data and resulting table is the size of biggest table's size.
The table are generated dynamically so I'd like to refrain from any hardcoding.
Consider the following two data frames:
table1 <- replicate(4,round(runif(10,0,1),2)) %>% as.data.frame %>% setNames(LETTERS[1:4])
table2 <- replicate(4,round(runif(6,0,1),2)) %>% as.data.frame %>% setNames(LETTERS[1:4])
table1
A B C D
1 0.81 0.08 0.85 0.89
2 0.88 0.82 0.62 0.77
3 0.12 0.13 0.99 0.02
4 0.17 0.54 0.37 0.62
5 0.77 0.10 0.81 0.34
6 0.58 0.15 0.00 0.56
7 0.61 0.15 0.59 0.15
8 0.52 0.36 0.12 0.99
9 0.83 0.93 0.29 0.30
10 0.52 0.02 0.48 0.46
table2
A B C D
1 0.95 0.81 0.99 0.92
2 0.18 0.99 0.35 0.09
3 0.73 0.10 0.02 0.68
4 0.37 0.53 0.78 0.02
5 0.48 0.54 0.79 0.83
6 0.75 0.32 0.41 0.04
We might create a new variable called ID from their row numbers and use that to sum the values after binding the rows:
library(dplyr)
library(tibble)
bind_rows(table1 %>% rowid_to_column("ID"),table2 %>% rowid_to_column("ID")) %>%
group_by(ID) %>%
summarise(across(everything(),sum))
# A tibble: 10 x 5
ID A B C D
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1.76 0.89 1.84 1.81
2 2 1.06 1.81 0.97 0.86
3 3 0.85 0.23 1.01 0.7
4 4 0.54 1.07 1.15 0.64
5 5 1.25 0.64 1.6 1.17
6 6 1.33 0.47 0.41 0.6
7 7 0.61 0.15 0.59 0.15
8 8 0.52 0.36 0.12 0.99
9 9 0.83 0.93 0.290 0.3
10 10 0.52 0.02 0.48 0.46
A potentially more dangerous base R approach is to subset table1 to the dimensions of table2, and add them together:
table1[seq(1,nrow(table2)),seq(1,ncol(table2))] <- table1[seq(1,nrow(table2)),seq(1,ncol(table2))] + table2
table1
A B C D
1 1.76 0.89 1.84 1.81
2 1.06 1.81 0.97 0.86
3 0.85 0.23 1.01 0.70
4 0.54 1.07 1.15 0.64
5 1.25 0.64 1.60 1.17
6 1.33 0.47 0.41 0.60
7 0.61 0.15 0.59 0.15
8 0.52 0.36 0.12 0.99
9 0.83 0.93 0.29 0.30
10 0.52 0.02 0.48 0.46
# Create your data frames
df1<-data.frame(a=c(1,2,3),b=c(2,3,4),c=c(3,4,5))
df2<-data.frame(a=c(1,2),b=c(2,3),c=c(3,4))
# Create a new data frame from the bigger of the two
if (nrow(df1)>nrow(df2)){
df3 <-df1
} else {
df3<-df2
}
# For each line in the smaller data frame add it to the larger
for (number in 1:min(nrow(df1),nrow(df2))){
df3[number,] <- df1[number,]+df2[number,]
}

Find position of elements of a dataframe inside other dataframe with R

I have the following dataframe (DF_A):
PARTY_ID PROBS_3001 PROBS_3002 PROBS_3003 PROBS_3004 PROBS_3005 PROBS_3006 PROBS_3007 PROBS_3008
1: 1000000 0.03 0.58 0.01 0.42 0.69 0.98 0.55 0.96
2: 1000001 0.80 0.37 0.10 0.95 0.77 0.69 0.23 0.07
3: 1000002 0.25 0.73 0.79 0.83 0.24 0.82 0.81 0.01
4: 1000003 0.10 0.96 0.53 0.59 0.96 0.10 0.98 0.76
5: 1000004 0.36 0.87 0.76 0.03 0.95 0.40 0.53 0.89
6: 1000005 0.15 0.78 0.24 0.21 0.03 0.87 0.67 0.64
And I have this other dataframe (DF_B):
V1 V2 V3 V4 PARTY_ID
1 0.58 0.69 0.96 0.98 1000000
2 0.69 0.77 0.80 0.95 1000001
3 0.79 0.81 0.82 0.83 1000002
4 0.76 0.96 0.96 0.98 1000003
5 0.76 0.87 0.89 0.95 1000004
6 0.64 0.67 0.78 0.87 1000005
I need to find the position of the elements of the DF_A in the DF_B to have something like this:
PARTY_ID P1 P2 P3 P4
1 1000000 3 6 9 7
...
Currently I'm working with match function but it takes a lot of time (I have 400K rows). I'm doing this:
i <- 1
while(i < nrow(DF_A)){
position <- match(DF_B[i,],DF_A[i,])
i <- i + 1
}
Although it works, it's very slow and I know that it's not the best answer to my problem. Can anyone help me please??
You can merge and then Map with a by group operation:
df_a2 <- df_a[setDT(df_b), on = "PARTY_ID"]
df_a3 <- df_a2[, c(PARTY_ID,
Map(f = function(x,y) which(x==y),
x = list(.SD[,names(df_a), with = FALSE]),
y = .SD[, paste0("V",1:4), with = FALSE])), by = 1:nrow(df_a2)]
setnames(df_a3, paste0("V",1:5), c("PARTY_ID", paste0("P", 1:4)))[,nrow:=NULL]
df_a3
# PARTY_ID P1 P2 P3 P4
#1: 1000000 3 6 9 7
#2: 1000001 7 6 2 5
#3: 1000002 4 8 7 5
#4: 1000003 9 3 3 8
#5: 1000003 9 6 6 8
#6: 1000004 4 3 9 6
#7: 1000005 9 8 3 7
Here is an example on 1 milion rows with two columns. It takes 14 ms on my computer.
# create data tables with matching ids but on different positions
x <- as.data.table(data.frame(id=sample(c(1:1000000), 1000000, replace=FALSE), y=sample(LETTERS, 1000000, replace=TRUE)))
y <- as.data.table(data.frame(id=sample(c(1:1000000), 1000000, replace=FALSE), z=sample(LETTERS, 1000000, replace=TRUE)))
# add column to both data tables which will store the position in x and y
x$x_row_nr <- 1:nrow(x)
y$y_row_nr <- 1:nrow(y)
# set key in both data frames using matching columns name
setkey(x, "id")
setkey(y, "id")
# merge data tables into one
z <- merge(x,y)
# now you just use this to extract what is the position
# of 100 hundreth record in x data table in y data table
z[x_row_nr==100, y_row_nr]
z will contain matching row records from both datasets with there columns attached.

R :How to execute FOR loop for Kmeans

I have an input file with Format as below :
RN KEY MET1 MET2 MET3 MET4
1 1 0.11 0.41 0.91 0.17
2 1 0.94 0.02 0.17 0.84
3 1 0.56 0.64 0.46 0.7
4 1 0.57 0.23 0.81 0.09
5 2 0.82 0.67 0.39 0.63
6 2 0.99 0.90 0.34 0.84
7 2 0.83 0.01 0.70 0.29
I have to execute Kmeans in R -separately for DF with Key=1 and Key=2 and so on...
Afterwards the final output CSV should look like
RN KEY MET1 MET2 MET3 MET4 CLST
1 1 0.11 0.41 0.91 0.17 1
2 1 0.94 0.02 0.17 0.84 1
3 1 0.56 0.64 0.46 0.77 2
4 1 0.57 0.23 0.81 0.09 2
5 2 0.82 0.67 0.39 0.63 1
6 2 0.99 0.90 0.34 0.84 2
7 2 0.83 0.01 0.70 0.29 2
Ie Key=1 is to be treated as separate DF and Key=2 is be treated as separate DF and so on...
Finally the output of clustering (of each DF)is to be combined with Key column first (since Key cannot participate in clustering) and then combined with each different DF for final output
In the above example :
DF1 is
KEY MET1 MET2 MET3 MET4
1 0.11 0.41 0.91 0.17
1 0.94 0.02 0.17 0.84
1 0.56 0.64 0.46 0.77
1 0.57 0.23 0.81 0.09
DF2 is
KEY MET1 MET2 MET3 MET4
2 0.82 0.67 0.39 0.63
2 0.99 0.90 0.34 0.84
2 0.83 0.01 0.70 0.29
Please suggest how to achieve in R
Psuedo code :
n<-Length(unique(Mydf$key))
for i=1 to n
{
#fetch partial df for each value of Key and run K means
dummydf<-subset(mydf,mydf$key=i
KmeansIns<-Kmeans(dummydf,2)
# combine with cluster result
dummydf<-data.frame(dummydf,KmeansIns$cluster)
# combine each smalldf into final Global DF
finaldf<-data.frame(finaldf,dummydf)
}Next i
#Now we have finaldf then it can be written to file
I think the easiest way would be to use by. Something along the lines of
by(data = DF, INDICES = DF$KEY, FUN = function(x) {
# your clustering code here
})
where x is a subset of your DF for each KEY.
A solution using data.tables.
library(data.table)
setDT(DF)[,CLST:=kmeans(.SD, centers=2)$clust, by=KEY, .SDcols=3:6]
DF
# RN KEY MET1 MET2 MET3 MET4 CLST
# 1: 1 1 0.11 0.41 0.91 0.17 2
# 2: 2 1 0.94 0.02 0.17 0.84 1
# 3: 3 1 0.56 0.64 0.46 0.70 1
# 4: 4 1 0.57 0.23 0.81 0.09 2
# 5: 5 2 0.82 0.67 0.39 0.63 2
# 6: 6 2 0.99 0.90 0.34 0.84 2
# 7: 7 2 0.83 0.01 0.70 0.29 1
#Read data
mdf <- read.table("mydat.txt", header=T)
#Convert to list based on KEY column
mls <- split(mdf, f=mdf$KEY)
#Define columns to use in clustering
myv <- paste0("MET", 1:4)
#Cluster each df item in list : modify kmeans() args as appropriate
kls <- lapply(X=mls, FUN=function(x){x$clust <- kmeans(x[, myv],
centers=2)$cluster ; return(x)})
#Make final "global" df
finaldf <- do.call(rbind, kls)

get the paired sample in R language [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
X<-scan()
1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1
1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 0 0 1 1 1
1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1
Z<-scan()
-0.05 0.11 -0.01 1.08 0.68 -1.79 -0.12 -0.06 0.17 -1.35 1.55 0.60
-1.42 -1.21 0.97 0.23 0.20 0.89 0.28 0.56 1.02 -0.32 0.20 -1.35
0.53 -0.52 -0.07 -1.07 0.10 0.53 0.97 0.32 -0.07 0.98 -1.23 0.72
-0.09 0.31 1.25 0.60 1.16 -0.98 1.63 0.72 0.24 -0.02 -1.13 0.56
0.78 1.75 -0.01 -0.44 0.47 -0.21 2.06 2.19 -0.94 -0.36 1.35 -1.35
1.50 0.13 -0.20 -0.57 -0.14 -1.34 -1.17 2.04 0.21 1.47 -1.20 -0.60
0.15 -0.64 -0.71 0.24 -0.86 -1.39 -0.63 -1.25 0.40 -0.76 0.73 -0.15
0.09 0.35 -0.19 0.29 0.56 0.82 -0.28 0.63 1.35 -0.04 1.99 1.12
-1.91 0.26 -1.18 -0.10
In the vector X, 0 is control group and 1 is case group.
I want to match this cases and controls based on Z vector.Actually I want to match elements of X based on Z ang get the samples from matched data.
what should I do?
The other answers seem to think that you're looking for subsetting, but I'm assuming (based on your use of the language "case" and "controls") that you're talking about matching in a statistical sense. If so, it sounds like you want something like the functionality provided by the Matching package, like the following:
library(Matching)
out <- Match(Tr=X,X=Z)
out$mdata # list of `Y` outcome vector (if applicable),
# `Tr` treatment vector, and
# `X` matrix of covariates for the matched sample
If you also have an outcome measure, you can specify that in Match and it will give you treatment effect estimates.
There are also other packages to do matching, like MatchIt, cem, and nonrandom (the last of which has apparently been removed from CRAN), depending on what particular matching procedure you're going for.
I suppose you are looking for
Z[as.logical(X)] # case
and
Z[!X] # control
I suppose your question is about subsetting, here is some examples:
# Data
X<-c(1,1,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,0,1,0,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,1,0,1,1,1,1,1,1,0,1,1,1,1,0,0,1,1,0,0,1,1,1,1,1,1,0,1,1,1,1,1,0,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1)
Z<-c(-0.05,0.11,-0.01,1.08,0.68,-1.79,-0.12,-0.06,0.17,-1.35,1.55,0.60,-1.42,-1.21,0.97,0.23,0.20,0.89,0.28,0.56,1.02,-0.32,0.20,-1.35,0.53,-0.52,-0.07,-1.07,0.10,0.53,0.97,0.32,-0.07,0.98,-1.23,0.72,-0.09,0.31,1.25,0.60,1.16,-0.98,1.63,0.72,0.24,-0.02,-1.13,0.56,0.78,1.75,-0.01,-0.44,0.47,-0.21,2.06,2.19,-0.94,-0.36,1.35,-1.35,1.50,0.13,-0.20,-0.57,-0.14,-1.34,-1.17,2.04,0.21,1.47,-1.20,-0.60,0.15,-0.64,-0.71,0.24,-0.86,-1.39,-0.63,-1.25,0.40,-0.76,0.73,-0.15,0.09,0.35,-0.19,0.29,0.56,0.82,-0.28,0.63,1.35,-0.04,1.99,1.12,-1.91,0.26,-1.18,-0.10)
myMatrix <- cbind(X,Z)
# Subsetting
myMatrixControls <- myMatrix[ myMatrix[,1]==0,]
myMatrixCases <- myMatrix[ myMatrix[,1]==1,]
# Example: get sum per group
sumZ_Contolrs <- sum(myMatrix[ myMatrix[,1]==0, 2])
sumZ_Cases <- sum(myMatrix[ myMatrix[,1]==1, 2])

Resources