Add columns from different data frames and stack on two indicators - r

We’d like to merge some columns from a data frame with the matching columns from various different data frames. Our main data frame predict looks as follows:
>predict
x1 x2 x3
1 1 1
0 1 0
1 1 0
1 1 0
0 0 1
(There may be more columns depending on the quantity of prediction runs)
Our goal is to merge this data frame with the y-columns from three different test data frames (df_1 df_2 and df_3) which all have the same structure. The needed columns are accessed through df_1$y[test] ([test] is a logical vector which identifies the 5 values which match our x-values) and have the same structure as the x-columns from predict.
The desired output would look like this:
>predict_test
x1 x2 x3 y1 y2 y3
1 1 1 1 1 1
0 1 0 0 0 0
1 1 0 0 1 0
1 1 0 1 1 1
0 0 1 0 0 1
In the next step we need to stack the x- and the y- columns into one column in order to do evaluations. It is important to stack them in the correct order, i.e. x2 under x1 and x3 under x2. The y-columns respectively.
>predict_test_stack
x_all y_all
1 1
0 0
1 0
1 1
0 0
1 1
1 0
1 1
1 1
0 0
1 1
0 0
0 0
0 1
1 1
This probably works with melt, but we don't know how to apply it while indicating two different id variables.
Thanks for your help.

data
df1 <- read.table(text = "x1 x2 x3
1 1 1
0 1 0
1 1 0
1 1 0
0 0 1",stringsAsFactors = FALSE,header=TRUE)
df2 <- read.table(text = "y1 y2 y3
1 1 1
0 0 0
0 1 0
1 1 1
0 0 1",stringsAsFactors = FALSE,header=TRUE)
solution
we concatenate the data.frames, then unlist the data.frame, keeping the correct number of columns. Finally we set the names by going into the data.frames to find the pattern.
list1 <- list(df1,df2)
side_by_side <- data.frame(list1)
# x1 x2 x3 y1 y2 y3
# 1 1 1 1 1 1 1
# 2 0 1 0 0 0 0
# 3 1 1 0 0 1 0
# 4 1 1 0 1 1 1
# 5 0 0 1 0 0 1
output <- data.frame(matrix(unlist(side_by_side),ncol = length(list1)))
names(output) <- sapply(list1,function(x){sub("[[:digit:]]","",names(x)[1])})
# x y
# 1 1 1
# 2 0 0
# 3 1 0
# 4 1 1
# 5 0 0
# 6 1 1
# 7 1 0
# 8 1 1
# 9 1 1
# 10 0 0
# 11 1 1
# 12 0 0
# 13 0 0
# 14 0 1
# 15 1 1

Related

How do you sum different columns of binary variables based on a desired set of variables/column?

I used the code below for a total of 25 variables and it worked.It shows up as either 1 or 0:
jb$finances <- ifelse(grepl("Finances", jb$content.roll),1,0)
I want to be able to add the number of "1" s in each row across the multiple of selected column/variables I just made (using the code above) into another column called "sum.content". I used the code below:
jb <- jb %>%
mutate(sum.content=sum(jb$Finances,jb$Exercise,jb$Volunteer,jb$Relationships,jb$Laugh,jb$Gratitude,jb$Regrets,jb$Meditate,jb$Clutter))
I didn't get an error using the code above, but I did not get the outcome I wanted.
The result of this was 14 for all my row.I was expecting something <9 since I only selected for 9 variables.I don't want to delete the other variables like V1 and V2, I just want to focus on summing some variables.
This is what I got using the code:
V1 V2... Finances Exercise Volunteer Relationships Laugh sum.content
1 1 1 1 1 0 14
2 0 1 0 0 1 14
2 0 0 0 0 1 14
This is What I want:
V1 V2... Finances Exercise Volunteer Relationships Laugh sum.content
1 1 1 1 1 0 4
2 0 1 0 0 1 1
2 0 0 0 0 1 1
I want R to add the number of 1's in each row(within the columns I want to select). How would I go about incorporating the adding of the 1's in code(from a set of variable/column)?
Here is an answer that uses dplyr to sum across rows of variables starting with the letter V. We'll simulate some data, convert to binary, and then sum the rows.
data <- matrix(rnorm(100,100,30),nrow = 10)
# recode to binary
data <- apply(data,2,function(x){x <- ifelse(x > 100,1,0)})
# change some of the column names to illustrate impact of
# select() within mutate()
colnames(data) <- c(paste0("V",1:5),paste0("X",1:5))
as.data.frame(data) %>%
mutate(total = select(.,starts_with("V")) %>% rowSums)
...and the output, where the sums should equal the sum of V1 - V5 but not
X1 - X5:
V1 V2 V3 V4 V5 X1 X2 X3 X4 X5 total
1 1 0 0 0 1 0 0 0 1 0 2
2 1 0 0 1 0 0 0 1 1 0 2
3 1 1 1 0 1 0 0 0 1 0 4
4 0 0 1 1 0 1 0 0 1 0 2
5 0 0 1 0 1 0 1 1 1 0 2
6 0 1 1 0 1 0 0 1 1 1 3
7 1 0 1 1 0 0 0 0 0 1 3
8 1 0 0 1 1 1 0 1 1 1 3
9 1 1 0 0 1 0 1 1 0 0 3
10 0 1 1 0 1 1 0 0 1 0 3
>

Change values in multiple columns with unique value and merge into single column (R)

Let's say I have a dataset (ds) with 4 rows with 3 variables as seen below:
ds
x1 x2 x3
1 0 0
0 0 1
0 1 0
0 0 1
How do I change the "1" to a unique value for each column and combine them into a single column?
So, the first step:
x1 x2 x3
1 0 0
0 0 3
0 2 0
0 0 3
Then, the second step (creating x4):
x1 x2 x3 x4
1 0 0 1
0 0 3 3
0 2 0 2
0 0 3 3
I have a lot more variables than this, I just want to know how to minimize the number of lines I write so it's not like 10+ lines.
You could do this:
df <- read.table(text="x1 x2 x3
1 0 0
0 0 1
0 1 0
0 0 1", header=TRUE, stringsAsFactors=FALSE)
df <- df*col(df)
df$x4 <- rowSums(df)
x1 x2 x3 x4
1 1 0 0 1
2 0 0 3 3
3 0 2 0 2
4 0 0 3 3

selecting rows according to all covariates combinations of a different dataframe

I am currently trying to figure out how to select all the rows of a long dataframe (long) that present the same x1 and x2 combinations characterizing another dataframe (short).
The simplified data are:
long <- read.table(text = "
id_type x1 x2
1 0 0
1 0 1
1 1 0
1 1 1
2 0 0
2 0 1
2 1 0
2 1 1
3 0 0
3 0 1
3 1 0
3 1 1
4 0 0
4 0 1
4 1 0
4 1 1",
header=TRUE)
and
short <- read.table(text = "
x1 x2
0 0
0 1",
header=TRUE)
The expected output would be:
id_type x1 x2
1 0 0
1 0 1
2 0 0
2 0 1
3 0 0
3 0 1
4 0 0
4 0 1
I have tried to use:
out <- long[unique(long[,c("x1", "x2")]) %in% unique(short[,c("x1", "x2")]), ]
but the %in% adoption is used wrongly here.. thank you very much for any help!
You are requesting an inner join:
> merge(long, short)
x1 x2 id_type
1 0 0 1
2 0 0 2
3 0 0 3
4 0 0 4
5 0 1 1
6 0 1 2
7 0 1 3
8 0 1 4

Selecting rows of a dataframe according to the correspondence of two covariates' levels

I am currently working on two different dataframes, one of which is extremely long (long). What I need to do is to select all the rows of long whose corresponding id_type appears at least once in the other (smaller) dataset.
Suppose the two dataframes are:
long <- read.table(text = "
id_type x1 x2
1 0 0
1 0 1
1 1 0
1 1 1
2 0 0
2 0 1
2 1 0
2 1 1
3 0 0
3 0 1
3 1 0
3 1 1
4 0 0
4 0 1
4 1 0
4 1 1",
header=TRUE)
and
short <- read.table(text = "
id_type y1 y2
1 5 6
1 5 5
2 7 9",
header=TRUE)
In practice, what I am trying to obtain is:
id_type x1 x2
1 0 0
1 0 1
1 1 0
1 1 1
2 0 0
2 0 1
2 1 0
2 1 1
I have tried to use out <- long[long[,"id_type"]==short[,"id_type"], ], but it is clearly wrong. How would you proceed? Thanks
Just use %in%:
out <- long[long$id_type %in% short$id_type, ]
Look at ?"%in%".
You where missing %in%:
> long[long$id_type %in% unique(short$id_type),]
id_type x1 x2
1 1 0 0
2 1 0 1
3 1 1 0
4 1 1 1
5 2 0 0
6 2 0 1
7 2 1 0
8 2 1 1

How to exclude cases that do not repeat X times in R?

I have a long format unbalanced longitudinal data. I would like to exclude all the cases that do not contain complete information. By that I mean all cases that do not repeat 8 times. Someone can help me finding a solution?
Below an example: I have three subjects {A, B, and C}. I have 8 information for A and B, but only 2 for C. How can I delete rows in which C is present based on the information it has less than 8 repeated measurements?
temp = scan()
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
A 1 1 1 1
A 0 1 0 0
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
B 1 1 1 1
B 0 1 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
C 1 1 1 1
C 0 1 0 0
Any help?
Assuming your variable names are V1, V2... and so on, here's one approach:
temp[temp$V1 %in% names(which(table(temp$V1) == 8)), ]
The table(temp$V1) == 8 matches the values in the V1 column that have exactly 8 cases. The names(which(... part creates a basic character vector that we can match using %in%.
And another:
temp[ave(as.character(temp$V1), temp$V1, FUN = length) == "8", ]
Here's another approach:
temp <- read.table(text="
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
A 1 1 1 1
A 0 1 0 0
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
B 1 1 1 1
B 0 1 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
C 1 1 1 1
C 0 1 0 0", header=FALSE)
do.call(rbind,
Filter(function(subgroup) nrow(subgroup) == 8,
split(temp, temp[[1]])))
split breaks the data.frame up by its first column, then Filter drops the subgroups that don't have 8 rows. Finally, do.call(rbind, ...) collapses the remaining subgroups back into a single data.frame.
If the first column of temp is character (rather than factor, which you can verify with str(temp)) and the rows are ordered by subgroup, you could also do:
with(rle(temp[[1]]), temp[rep(lengths==8, times=lengths), ])

Resources