I have two dataframes.
First dataframe:
#################################################
### create first data.frame
df1 <- data.frame(matrix(ncol = 5, nrow = 8))
### vector of column names
cols <- c("identifier1", "identifier2", "time", "lightsaber_length", "blaster_length")
### assign column names to df1
colnames(df1) <- cols
### generate random data for time column
tenths <- seq(from = 0, to = 10, by = .1)
sample_time <- sample(tenths, size = 8, replace = TRUE)
### generate random data for lightsaber_length and blaster_length columns
hundredths <- seq(from = 0, to = 1, by = .01)
sample_lightsaber <- sample(hundredths, size = 8, replace = TRUE)
sample_blaster <- sample(hundredths, size = 8, replace = TRUE)
### Assign column values
df1$identifier1 <- c(1, 1, 1, 2, 2, 3, 3, 4)
df1$identifier2 <- c("hello", "hello", "hello", "there", "there", "general", "general", "kenobi")
df1$time <- sample_time
df1$lightsaber_length <- sample_lightsaber
df1$blaster_length <- sample_blaster
second dataframe:
### create second data.frame
df2 <- data.frame(matrix(ncol = 4, nrow = 8))
### vector of column names
cols <- c("study_id", "identifier1", "identifier2", "object")
### assign column names to df2
colnames(df2) <- cols
### create new study_id column, where study_id equals row number
df2$study_id <- 1:nrow(df2)
### move study_id column to front
df2 <- df2 %>% relocate(study_id, .before = identifier1)
### assign column values
df2$identifier1 <- c(1, 1, 2, 2, 3, 3, 4, 4)
df2$identifier2 <- c("hello", "hello", "there", "there", "general", "general", "kenobi", "kenobi")
df2$object <- c("lightsaber", "blaster")
I want a third dataframe, generated from wrangling df1 and df2, that combines 'lightsaber_length' and 'blaster_length' into one 'length' column. However, I want to retain which 'object' a 'length' value corresponds to by assigning the appropriate 'study_id' in the same row.
Each 'study_id' value represents a unique combination of 'identifier1', 'identifier2', and 'object', and does away with the need to have two separate length columns.
Desired output:
As my actual data is larger and more varied, I would like a solution that is not unique to the small example I made here.
We can reshape to 'long' format with pivot_longer and then do a join
library(tidyr)
library(dplyr)
df1 %>%
pivot_longer(cols = ends_with("length"), names_to = c("object", ".value"),
names_sep = "_") %>%
left_join(df2)
-output
# A tibble: 16 x 6
identifier1 identifier2 time object length study_id
<dbl> <chr> <dbl> <chr> <dbl> <int>
1 1 hello 4.5 lightsaber 0.28 1
2 1 hello 4.5 blaster 0.92 2
3 1 hello 1.7 lightsaber 0.42 1
4 1 hello 1.7 blaster 0.57 2
5 1 hello 8.5 lightsaber 0.55 1
6 1 hello 8.5 blaster 0.88 2
7 2 there 4.7 lightsaber 0.09 3
8 2 there 4.7 blaster 0.37 4
9 2 there 0.1 lightsaber 0.57 3
10 2 there 0.1 blaster 0.74 4
11 3 general 3.6 lightsaber 0.18 5
12 3 general 3.6 blaster 0.15 6
13 3 general 6.8 lightsaber 0.84 5
14 3 general 6.8 blaster 0.66 6
15 4 kenobi 1.2 lightsaber 0.76 7
16 4 kenobi 1.2 blaster 0.64 8
I have a vector that results from a square matrix as below
P = as.vector(matrix(c(1,2,3,4),nrow=2))
What would be the simplest way of arranging this vector to get a response similar to what I have below as columns
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4
1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4
1,2,1,2,1,2,1,2,3,4,3,4,3,4,3,4,1,2,1,2,1,2,1,2,3,4,3,4,3,4,3,4,1,2,1,2,1,2,1,2,3,4,3,4,3,4,3,4,1,2,1,2,1,2,1,2,3,4,3,4,3,4,3,4
I have been able to arrange the first 2 columns as
library(tidyverse)
df <- expand.grid(rep(c(1, 2, 3, 4),2))
df1 <- df %>% arrange_all()
df = expand.grid(a = df1[,1], b = df1[,1])
df[,c(2,1)]
The last column should repeat as a whole through
1,2,1,2,1,2,1,2,3,4,3,4,3,4,3,4
Does this work:
as.vector(apply(matrix(P, nrow = 2), 2, rep, 4))
[1] 1 2 1 2 1 2 1 2 3 4 3 4 3 4 3 4
paste(as.vector(apply(matrix(P, nrow = 2), 2, rep, 4)), collapse = ',')
[1] "1,2,1,2,1,2,1,2,3,4,3,4,3,4,3,4"
I am trying to use the values from a column to extract column numbers in a data frame. My problem is similar to this topic in r-bloggers. Copying the script here:
df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c("x", "y", "x", "z"),
stringsAsFactors = FALSE)
However, instead of having column names in choice, I have column index number, such that my data frame looks like this:
df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c(1, 2, 1, 3),
stringsAsFactors = FALSE)
I tried using this solution:
df$newValue <-
df[cbind(
seq_len(nrow(df)),
match(df$choice, colnames(df))
)]
Instead of giving me an output that looks like this:
# x y choice newValue
# 1 1 4 1 1
# 2 2 5 2 2
# 3 3 6 1 6
# 4 8 9 3 NA
My newValue column returns all NAs.
# x y choice newValue
# 1 1 4 1 NA
# 2 2 5 2 NA
# 3 3 6 1 NA
# 4 8 9 3 NA
What should I modify in the code so that it would read my choice column as column index?
As you have column numbers which we need to extract from data frame already we don't need match here. However, since there is a column called choice in the data which you don't want to consider while extracting data we need to turn the values which are not in the range to NA before subsetting from the dataframe.
mat <- cbind(seq_len(nrow(df)), df$choice)
mat[mat[, 2] > (ncol(df) -1), ] <- NA
df$newValue <- df[mat]
df
# x y choice newValue
#1 1 5 1 1
#2 2 6 2 6
#3 3 7 1 3
#4 4 8 3 NA
data
df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c(1, 2, 1, 3))
Below I define a master dataset of dimensions 12x5. I divide it into four data.tables and I want to merge them. There is no row ID overlap between data.tables and some column name overlap. When I merge them, merge() doesn't recognize column name matches, and creates new columns for every column in each data.table. The final merged data.table should be 12x5, but it is coming out as 12x7. I thought that the all=TRUE command in data.table's merge() would solve this.
library(data.table)
a <- data.table(id = c(1, 2, 3), C1 = c(1, 2, 3))
b <- data.table(id = c(4, 5, 6), C1 = c(1, 2, 3), C2 = c(2, 3, 4))
c <- data.table(id = c(7, 8, 9), C3 = c(5, 2, 7))
d <- data.table(id = c(10, 11, 12), C3 = c(8, 2, 3), C4 = c(4, 6, 8))
setkey(a, "id")
setkey(b, "id")
setkey(c, "id")
setkey(d, "id")
final <- merge(a, b, all = TRUE)
final <- merge(final, c, all = TRUE)
final <- merge(final, d, all = TRUE)
names(final)
dim(final) #outputs correct numb of rows, but too many columns
The problem is with the way you are using the 'merge' function.
'merge' function in data.table package by default merges two data tables by the "shared key columns between them". Suppose you create 'a' and 'b' data tables like this:
library(data.table)
a <- data.table(id = c(1, 2, 3), C1 = c(1, 2, 3))
b <- data.table(id = c(4, 5, 6), C1 = c(1, 2, 3), C2 = c(2, 3, 4))
setkey(a, "id")
setkey(b, "id")
where 'a' is going to be like:
id C1
1: 1 1
2: 2 2
3: 3 3
and 'b' is going to be like:
id C1 C2
1: 4 1 2
2: 5 2 3
3: 6 3 4
now, lets first try your code:
merge(a, b, all = TRUE)
This is the result:
id C1.x C1.y C2
1: 1 1 NA NA
2: 2 2 NA NA
3: 3 3 NA NA
4: 4 NA 1 2
5: 5 NA 2 3
6: 6 NA 3 4
This is due to the fact that 'merge' function is taking only 'id' field (shared key between data tables 'a' and 'b') as the merging column, while adding all non-shared columns to the resulting data table. Now lets try specifying what columns to merge on:
merge(a, b, by=c("id","C1"), all = TRUE)
now the result is going to be:
id C1 C2
1: 1 1 NA
2: 2 2 NA
3: 3 3 NA
4: 4 1 2
5: 5 2 3
6: 6 3 4
Same applies to other merge functions you called. So try this:
final <- merge(a, b, by=c("id","C1"), all = TRUE)
final <- merge(final, c, by="id", all = TRUE) #here you don't necessarily need to specify by...
final <- merge( final, d, by=c("id","C3"),all=TRUE)
dim(final)
[1] 12 5