I have a tab-delimited text file with a series of timestamped data. I've read it into R using read.delim() and it gives me all the data as characters in a single column. Example:
df <- data.frame(c("2017","A","B","C","2018","X","Y","Z","2018","X","B","C"))
colnames(df) <- "col1"
df
I want to convert every n # of rows (in this case 4) to columns and stack them without using a for loop. Desired result:
col1 <- c("2017","2018","2018")
col2 <- c("A","X","X")
col3 <- c("B","Y","B")
col4 <- c("C","Z","C")
df2 <- data.frame(col1, col2, col3, col4)
df2
I created a for loop, but it can't handle the millions of rows in my df. Should I convert to a matrix? Would converting to a list help? I tried as.matrix(read.table()) and unlist() but without success.
You could use tidyr to reshape data into the form you want, you will first need to mutate the data as to identify which indexes should be first, and which go with a specific column.
Assuming you know there are 4 groups (n = 4) you could do something like the following with the help of the dplyr package.
library(tidyr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
n <- 4
df <- data.frame(x = c("2017","A","B","C","2018","X","Y","Z","2018","X","B","C")) %>%
mutate(cols = rep(1:n, n()/n),
id = rep(1:(n()/n), each = n))
pivot_wider(df, id_cols = id, names_from = cols, values_from = x, names_prefix = "cols")
#> # A tibble: 3 × 5
#> id cols1 cols2 cols3 cols4
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 2017 A B C
#> 2 2 2018 X Y Z
#> 3 3 2018 X B C
Or, in base you could use the split function on the vector, and then use do.call to make the data frame
df <- data.frame(x = c("2017","A","B","C","2018","X","Y","Z","2018","X","B","C"))
split_df <- setNames(split(df$x, rep(1:4, 3)), paste0("cols", 1:4))
do.call("data.frame", split_df)
#> cols1 cols2 cols3 cols4
#> 1 2017 A B C
#> 2 2018 X Y Z
#> 3 2018 X B C
Created on 2022-02-01 by the reprex package (v2.0.1)
The easiest way would be to create a matrix with matrix(ncol=x, byrow=TRUE), then convert back to data.frame. Should be quite fast too.
df |>
unlist() |>
matrix(ncol=4, byrow = TRUE) |>
as.data.frame() |>
setNames(paste0('col', 1:4))
col1 col2 col3 col4
1 2017 A B C
2 2018 X Y Z
3 2018 X B C
Related
I'm still learning R and was wondering if I there was an elegant way of manipulating the below df to achieve df2.
I'm not sure if it's a loop that is supposed to be used for this, but basically I want to extract the first Non NA "X_No" Value if the "X_No" value is NA in the first row. This would perhaps be best described through an example from df to the desired df2.
A_ID <- c('A','B','I','N')
A_No <- c(11,NA,15,NA)
B_ID <- c('B','C','D','J')
B_No <- c(NA,NA,12,NA)
C_ID <- c('E','F','G','P')
C_No <- c(NA,13,14,20)
D_ID <- c('J','K','L','M')
D_No <- c(NA,NA,NA,40)
E_ID <- c('W','X','Y','Z')
E_No <- c(50,32,48,40)
df <- data.frame(A_ID,A_No,B_ID,B_No,C_ID,C_No,D_ID,D_No,E_ID,E_No)
ID <- c('A','D','F','M','W')
No <- c(11,12,13,40,50)
df2 <- data.frame(ID,No)
I'm hoping for an elegant solution to this as there are over a 1000 columns similar to the example provided.
I've looked all over the web for a similar example however to no avail that would reproduce the expected result.
Your help is very much appreciated.
Thankyou
I don't know if I'd call it "elegant", but here is a potential solution:
library(tidyverse)
A_ID <- c('A','B','I','N')
A_No <- c(11,NA,15,NA)
B_ID <- c('B','C','D','J')
B_No <- c(NA,NA,12,NA)
C_ID <- c('E','F','G','P')
C_No <- c(NA,13,14,20)
D_ID <- c('J','K','L','M')
D_No <- c(NA,NA,NA,40)
E_ID <- c('W','X','Y','Z')
E_No <- c(50,32,48,40)
df <- data.frame(A_ID,A_No,B_ID,B_No,C_ID,C_No,D_ID,D_No,E_ID,E_No)
ID <- c('A','D','F','M','W')
No <- c(11,12,13,40,50)
df2 <- data.frame(ID,No)
output <- df %>%
pivot_longer(everything(),
names_sep = "_",
names_to = c("Col", ".value")) %>%
drop_na() %>%
group_by(Col) %>%
slice_head(n = 1) %>%
ungroup() %>%
select(-Col)
df2
#> ID No
#> 1 A 11
#> 2 D 12
#> 3 F 13
#> 4 M 40
#> 5 W 50
output
#> # A tibble: 5 × 2
#> ID No
#> <chr> <dbl>
#> 1 A 11
#> 2 D 12
#> 3 F 13
#> 4 M 40
#> 5 W 50
all_equal(df2, output)
#> [1] TRUE
Created on 2023-02-08 with reprex v2.0.2
Using base R with max.col (assuming the columns are alternating with ID, No)
ind <- max.col(!is.na(t(df[c(FALSE, TRUE)])), "first")
m1 <- cbind(seq_along(ind), ind)
data.frame(ID = t(df[c(TRUE, FALSE)])[m1], No = t(df[c(FALSE, TRUE)])[m1])
ID No
1 A 11
2 D 12
3 F 13
4 M 40
5 W 50
Here is a data.table solution that should scale well to a (very) large dataset.
functionally
split the data.frame to a list of chunks of columns, based on their
names. So all columns startting with A_ go to
the first element, all colums startting with B_ to the second
Then, put these list elements on top of each other, using
data.table::rbindlist. Ignure the column-namaes (this only works if
A_ has the same number of columns as B_ has the same number of cols
as n_)
Now get the first non-NA value of each value in the first column
code
library(data.table)
# split based on what comes after the underscore
L <- split.default(df, f = gsub("(.*)_.*", "\\1", names(df)))
# bind together again
DT <- rbindlist(L, use.names = FALSE)
# extract the first value of the non-NA
DT[!is.na(A_No), .(No = A_No[1]), keyby = .(ID = A_ID)]
# ID No
# 1: A 11
# 2: D 12
# 3: F 13
# 4: G 14
# 5: I 15
# 6: M 40
# 7: P 20
# 8: W 50
# 9: X 32
#10: Y 48
#11: Z 40
Given the following tibbles:
df1<- tibble(A = c(1:10), B=sample(c(21:30)))
df2<-tibble(A = c(1,2,4,6,7))
I want to create df3 which contains all the rows in which df1$A is found in df2$A. I've tried
df3<- df1 %>% filter(A == df2%A))
but this returns only 2 rows, because it is matching the rows, not searching for the values. My real data set is several thousand rows.
Thanks in advance!
library(tidyverse)
df1<- tibble(A = c(1:10), B=sample(c(21:30)))
df2<-tibble(A = c(1,2,4,6,7))
df1 %>%
filter(df1$A %in% df2$A)
The proper way to do this is to use a semi_join()
E.g.,
library(tidyverse)
set.seed(123)
df1 <- tibble(A = c(1:10), B = sample(c(21:30)))
df2 <- tibble(A = c(1, 2, 4, 6, 7))
df3 <- semi_join(df1, df2, by = "A")
df3
#> # A tibble: 5 x 2
#> A B
#> <int> <int>
#> 1 1 23
#> 2 2 30
#> 3 4 28
#> 4 6 29
#> 5 7 21
Created on 2020-05-06 by the reprex package (v0.3.0)
If I have a data set like this:
names <- c("Dave", "Ashley", "Drew")
score1 <- c(5, 1, 3)
opponent <- c("Drew", "Dave", "Ashley")
x <- cbind(names, score1, opponent)
x
y <- as.numeric(ifelse(x[, 3]==x[1, 1], x[1, "score1"], ifelse(
x[, 3]==x[2, 1], x[2, "score1"], ifelse(
x[, 3]==x[3, 1], x[3, "score1"], 1))))
y <- (y * score1)
x <- cbind(x, y)
x
Can I create a loop, to create a new column, where the number in the "score1" column is multiplied by the number from the "y" column from a different row. For instance, create a new column where the values at [1, 2] would be 5 since "Dave" had "Drew" in his row so "Dave"’s "score1" column “5” is multiplied by "Drew"’s "score1" column “3”. Is there a way to do this in a loop that would work for a hundred rows and a hundred columns? Currently, the only way I know to do it is to write a ton of "ifelse" statements like above.
You can perform a self join. Although, you'll have to be careful of duplicates if any names are repeated.:
names <- c("Dave", "Ashley", "Drew")
score1 <- c(5, 1, 3)
opponent <- c("Drew", "Dave", "Ashley")
x <- data.frame(names,score1,opponent)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
new_df <- left_join(x,x[,-3],by=c("opponent"="names"))
new_df
#> names score1.x opponent score1.y
#> 1 Dave 5 Drew 3
#> 2 Ashley 1 Dave 5
#> 3 Drew 3 Ashley 1
new_df <- mutate(new_df,y=score1.x * score1.y)
new_df
#> names score1.x opponent score1.y y
#> 1 Dave 5 Drew 3 15
#> 2 Ashley 1 Dave 5 5
#> 3 Drew 3 Ashley 1 3
Created on 2019-05-07 by the reprex package (v0.2.1)
I have two dataframes that I am trying to merge :
set.seed(123)
df1 <- data.frame(ID=sample(letters[1:6],10,replace=TRUE))
df2 <- data.frame(
ID1 = letters[1:2],
ID2 = letters[3:4],
ID3 = letters[5:6],
V1 = c(23.32,21.24),
V2 = c(45.32,47.21)
)
Post merging, I want my df1 to contain the columns V1 and V2 along with ID. I have tried using merge, left_join and inner_join (from dplyr) but can't figure out how to use the by argument. The ID column from df1 could exist in any of the three columns (ID1, ID2 and ID3) of df2. How can I achieve this?
You have to reshape in long format first, then join:
library(dplyr)
library(tidyr)
df2 %>%
gather(IDnr, ID, 1:3) %>%
left_join(df1, ., by = 'ID')
# alternative:
df1 %>%
left_join(., df2 %>% gather(IDnr, ID, 1:3), by = 'ID')
The result:
ID V1 V2 IDnr
1 d 21.24 47.21 ID2
2 e 23.32 45.32 ID3
3 f 21.24 47.21 ID3
4 d 21.24 47.21 ID2
5 f 21.24 47.21 ID3
6 c 23.32 45.32 ID2
7 a 23.32 45.32 ID1
8 e 23.32 45.32 ID3
9 a 23.32 45.32 ID1
10 d 21.24 47.21 ID2
The by argument is used to specify the ID columns you want to join on assuming they are named differently for the left & right tables (if its the same name then it will automatically choose).
However, I have a way to simplify what you want to do. First, why not reshape df2 to only have 1 single ID column (assumption that the ID is unique within all 3 columns). You can do this by creating 3 separate dfs and then union together using bind_rows.
Now that it is reshaped, you can do a right join. df1 is on the right side of the join and all the records in df1 will remain whether or not there is a match with df2 (otherwise V1 and V2 will be NULL i.e. NA).
With the sample data provided from df1, the results would be unexpected because each ID is repeated and not unique (so I have redefined df1 to have unique IDs only). If the IDs are not unique, you can group the results by ID and do an aggregation prior to doing the join.
set.seed(123)
#df1 <- data.frame(ID=sample(letters[1:6],10,replace=TRUE)) #This one has repeated IDs
df1 <- data.frame(ID=letters[1:6])
df2 <- data.frame(
ID1 = letters[1:2],
ID2 = letters[3:4],
ID3 = letters[5:6],
V1 = c(23.32,21.24),
V2 = c(45.32,47.21)
)
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.4.2
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- bind_rows(df2 %>% select(ID=ID1, V1, V2),
df2 %>% select(ID=ID2, V1, V2),
df2 %>% select(ID=ID3, V1, V2)) %>%
right_join(df1)
#> Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> Joining, by = "ID"
#> Warning: Column `ID` joining character vector and factor, coercing into
#> character vector
df
#> ID V1 V2
#> 1 a 23.32 45.32
#> 2 b 21.24 47.21
#> 3 c 23.32 45.32
#> 4 d 21.24 47.21
#> 5 e 23.32 45.32
#> 6 f 21.24 47.21
I'd like to merge two data frames where df2 overwrites any values that are NA or present in df1. Merge data frames and overwrite values provides a data.table option, but I'd like to know if there is a way to do this with dplyr. I've tried all of the _join options but none seem to do this. Is there a way to do this with dplyr?
Here is an example:
df1 <- data.frame(y = c("A", "B", "C", "D"), x1 = c(1,2,NA, 4))
df2 <- data.frame(y = c("A", "B", "C"), x1 = c(5, 6, 7))
Desired output:
y x1
1 A 5
2 B 6
3 C 7
4 D 4
I think what you want is to keep the values of df2 and only add the ones in df1 that are not present in df2 which is what anti_join does:
"anti_join return all rows from x where there are not matching values in y, keeping just columns from x."
My solution:
df3 <- anti_join(df1, df2, by = "y") %>% bind_rows(df2)
Warning messages:
1: In anti_join_impl(x, y, by$x, by$y) :
joining factors with different levels, coercing to character vector
2: In rbind_all(x, .id) : Unequal factor levels: coercing to character
> df3
Source: local data frame [4 x 2]
y x1
(chr) (dbl)
1 D 4
2 A 5
3 B 6
4 C 7
this line gives the desired output (in a different order) but, you should pay attention to the warning message, when working with your dataset be sure to read y as a character variable.
This is the idiom I now use, as, in addition, it handles keeping columns that are not part of the update table. I use some different names than from the OP, but the flavor is similar.
The one thing I do is create a variable for the keys used in the join, as I use that in a few spots. But otherwise, it does what is desired.
In itself it doesn't handle the action of, for example, "update this row if a value is NA", but you should exercise that condition when creating the join table.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
.keys <- c("key1", "key2")
.base_table <- tribble(
~key1, ~key2, ~val1, ~val2,
"A", "a", 0, 0,
"A", "b", 0, 1,
"B", "a", 1, 0,
"B", "b", 1, 1)
.join_table <- tribble(
~key1, ~key2, ~val2,
"A", "b", 100,
"B", "a", 111)
# This works
df_result <- .base_table %>%
# Pull off rows from base table that match the join table
semi_join(.join_table, .keys) %>%
# Drop cols from base table that are in join table, except for the key columns
select(-matches(setdiff(names(.join_table), .keys))) %>%
# Left join on the join table columns
left_join(.join_table, .keys) %>%
# Remove the matching rows from the base table, and bind on the newly joined result from above.
bind_rows(.base_table %>% anti_join(.join_table, .keys))
df_result %>%
print()
#> # A tibble: 4 x 4
#> key1 key2 val1 val2
#> <chr> <chr> <dbl> <dbl>
#> 1 A b 0 100
#> 2 B a 1 111
#> 3 A a 0 0
#> 4 B b 1 1
Created on 2019-12-12 by the reprex package (v0.3.0)