Suppose I start with a data frame:
ID Measurement1 Measurement2
1 45 104
2 34 87
3 23 99
4 56 67
...
Then I have a second data frame which is meant to be used to update records in the first:
ID Measurement1 Measurement2
2 10 11
4 21 22
How do I use R to end up with:
ID Measurement1 Measurement2
1 45 104
2 10 11
3 23 99
4 21 22
...
The data frames in reality are very large datasets.
We can use match to get the row index. Using that index to subset the rows, we replace the 2nd and 3rd columns of the first dataset with the corresponding columns of second dataset.
ind <- match(df2$ID, df1$ID)
df1[ind, 2:3] <- df2[2:3]
df1
# ID Measurement1 Measurement2
#1 1 45 104
#2 2 10 11
#3 3 23 99
#4 4 21 22
Or we can use data.table to join the dataset on the 'ID' column (after converting the first dataset to 'data.table' i.e. setDT(df1)), and assign the 'Cols' with the 'iCols' from the second dataset.
library(data.table)#v1.9.6+
Cols <- names(df1)[-1]
iCols <- paste0('i.', Cols)
setDT(df1)[df2, (Cols) := mget(iCols), on= 'ID'][]
# ID Measurement1 Measurement2
#1: 1 45 104
#2: 2 10 11
#3: 3 23 99
#4: 4 21 22
data
df1 <- structure(list(ID = 1:4, Measurement1 = c(45L, 34L, 23L, 56L),
Measurement2 = c(104L, 87L, 99L, 67L)), .Names = c("ID",
"Measurement1", "Measurement2"), class = "data.frame",
row.names = c(NA, -4L))
df2 <- structure(list(ID = c(2L, 4L), Measurement1 = c(10L, 21L),
Measurement2 = c(11L,
22L)), .Names = c("ID", "Measurement1", "Measurement2"),
class = "data.frame", row.names = c(NA, -2L))
library(dplyr)
df1 %>%
anti_join(df2, by = "ID") %>%
bind_rows(df2) %>%
arrange(ID)
dplyr 1.0.0 introduced a family of SQL-inspired functions for modifying rows. In this case you can now use rows_update():
library(dplyr)
df1 %>%
rows_update(df2, by = "ID")
ID Measurement1 Measurement2
1 1 45 104
2 2 10 11
3 3 23 99
4 4 21 22
Related
I have a dataframe (DF) with 4 columns. How do I make it so if column 4 is either a 0 or an NA, then remove the whole row? So in the example below only row 1 would be left.
Column 1 Column 2 Column 3 Column 4
11 24 234 2123
45 63 22 0
234 234 123 NA
using dplyr
library(dplyr)
df %>% filter(!is.na(Column.4) & Column.4 != 0)
You can use logical vectors to subset your data:
df[!is.na(df[,4]) & (df[,4]!=0), ]
Example:
df = data.frame(x = rnorm(30), y = rnorm(30), z = rnorm(30), a = rep(c(0,1,NA),10))
x y z a
2 -0.21772820 -0.5337648 -1.07579623 1
5 0.64536474 0.2011776 -0.12981424 1
8 2.36411372 0.0343823 2.03561701 1
11 1.09103526 -1.9287689 0.59511269 1
14 0.32482389 -0.5562136 -0.38943092 1
17 0.63621067 -1.6517097 -0.09804529 1
20 2.61892085 1.5575784 -0.50803567 1
23 0.07854647 1.1861483 -0.49798074 1
26 0.19561725 1.1036331 -0.66349688 1
29 0.22470875 -0.4192745 0.09153176 1
You can use sapply to loop thru each row and it will display the rows the rows that satisfy the underlying conditions:
df[sapply(1:nrow(df), function(i) all(!is.na(df[i,])) & all(df[i,] != 0)), ]
Data:
structure(list(Column.1 = c(11L, 45L, 234L), Column.2 = c(24L,
63L, 234L), Column.3 = c(234L, 22L, 123L), Column.4 = c(2123L,
0L, NA)), class = "data.frame", row.names = c(NA, -3L)) -> df
Output:
# Column.1 Column.2 Column.3 Column.4
# 1 11 24 234 2123
I have a df in R as follows:
ID Age Score1 Score2
2 22 12 NA
3 19 11 22
4 20 NA NA
1 21 NA 20
Now I want to only remove the rows where both Score 1 and Score 2 is missing (i.e. 3rd row)
You can filter it like this:
df <- read.table(head=T, text="ID Age Score1 Score2
2 22 12 NA
3 19 11 22
4 20 NA NA
1 21 NA 20")
df[!(is.na(df$Score1) & is.na(df$Score2)), ]
# ID Age Score1 Score2
# 1 2 22 12 NA
# 2 3 19 11 22
# 4 1 21 NA 20
I.e. take rows where there's not (!) Score1 missing and (&) Score2 missing.
Here are two version with dplyr which can be extended to many columns with prefix "Score".
Using filter_at
library(dplyr)
df %>% filter_at(vars(starts_with("Score")), any_vars(!is.na(.)))
# ID Age Score1 Score2
#1 2 22 12 NA
#2 3 19 11 22
#3 1 21 NA 20
and filter_if
df %>% filter_if(startsWith(names(.),"Score"), any_vars(!is.na(.)))
A base R version with apply
df[apply(!is.na(df[startsWith(names(df),"Score")]), 1, any), ]
One option is rowSums
df1[ rowSums(is.na(df1[grep("Score", names(df1))])) < 2,]
Or another option with base R
df1[!Reduce(`&`, lapply(df1[grep("Score", names(df1))], is.na)),]
data
df1 <- structure(list(ID = c(2L, 3L, 4L, 1L), Age = c(22L, 19L, 20L,
21L), Score1 = c(12L, 11L, NA, NA), Score2 = c(NA, 22L, NA, 20L
)), class = "data.frame", row.names = c(NA, -4L))
I have 2 data frames (DF1 & DF2) and 1 would like to join them together by a unique value called "acc_num". In DF2, payment was made twice by acc_num A and thrice by B. Data frames are as follows.
DF1:
acc_num total_use sales
A 433 145
A NA 2
A NA 18
B 149 32
DF2:
acc payment
A 150
A 98
B 44
B 15
B 10
My desired output is:
acc_num total_use sales payment
A 433 145 150
A NA 2 98
A NA 18 NA
B 149 32 44
B NA NA 15
B NA NA 10
I've tried full_join and merge but the output was not as desired. I couldn't work this out as I'm still a beginner in R, and haven't found the solution to this.
Example of the code I used was
test_full_join <- DF1 %>% full_join(DF2, by = c("acc_num" = "acc"))
The displayed output was:
acc_num total_use sales payment
A 433 145 150
A 433 145 98
A NA 2 150
A NA 2 98
A NA 18 150
A NA 18 98
B 149 32 44
B 149 32 15
B 149 32 10
This is contrary to my desired output as at the end,
my concern is to get the total sum of total_use, sales and payment.
This output will definitely give me wrong interpretation
for data visualization later on.
We may need to do a join by row_number() based on 'acc_num'
library(dplyr)
df1 %>%
group_by(acc_num) %>%
mutate(grpind = row_number()) %>%
full_join(df2 %>%
group_by(acc_num = acc) %>%
mutate(grpind = row_number())) %>%
select(acc_num, total_use, sales, payment)
# A tibble: 6 x 4
# Groups: acc_num [2]
# acc_num total_use sales payment
# <chr> <int> <int> <int>
#1 A 433 145 150
#2 A NA 2 98
#3 A NA 18 NA
#4 B 149 32 44
#5 B NA NA 15
#6 B NA NA 10
data
df1 <- structure(list(acc_num = c("A", "A", "A", "B"), total_use = c(433L,
NA, NA, 149L), sales = c(145L, 2L, 18L, 32L)), class = "data.frame",
row.names = c(NA,
-4L))
df2 <- structure(list(acc = c("A", "A", "B", "B", "B"), payment = c(150L,
98L, 44L, 15L, 10L)), class = "data.frame", row.names = c(NA,
-5L))
If there is a list of unequal number of row vectors(all the vectors have 3 columns) like below:
>typicalList
[[1]]
col1 col2 col3
1 12 10 ABC
2 54 87 DEF
[[2]]
col1 col2 col3
1 64 9 GHI
2 59 6 JKL
3 43 4 PST
Is it possible to have a dataframe from the above list with a new column called newColumn that looks like below:
newColumn col1 col2 col3
1 12 10 ABC
1 54 87 DEF
2 64 9 GHI
2 59 6 JKL
2 43 4 PST
Used ldply(typicallist,rbind) but that splits all the rows belonging to a vector in the original list giving 5 independent records in the dataframe. Is it possible to have the dataframe like above that suggests (through newColumn) field that first two records are derived from the first vector of the list and the remaining three from the second? Is there any better way to realize this in R?
Data
typicalList <- list(structure(list(col1 = c(12L, 54L), col2 = c(10L, 87L), col3 = c("ABC",
"DEF")), .Names = c("col1", "col2", "col3"), class = "data.frame", row.names = c("1",
"2")), structure(list(col1 = c(64L, 59L, 43L), col2 = c(9L, 6L,
4L), col3 = c("GHI", "JKL", "PST")), .Names = c("col1", "col2",
"col3"), class = "data.frame", row.names = c("1", "2", "3")))
We can use rbindlist from data.table with the idcol argument
library(data.table)
rbindlist(typicalList, idcol = "newColumn")
# newColumn col1 col2 col3
#1: 1 12 10 ABC
#2: 1 54 87 DEF
#3: 2 64 9 GHI
#4: 2 59 6 JKL
#5: 2 43 4 PST
Or use bind_rows with .id from dplyr
library(dplyr)
bind_rows(typicalList, .id = "newColumn")
# newColumn col1 col2 col3
#1 1 12 10 ABC
#2 1 54 87 DEF
#3 2 64 9 GHI
#4 2 59 6 JKL
#5 2 43 4 PST
I would like to splite each row of a data frame(numberic) into two rows. For example, part of the original data frame like this (nrow(original datafram) > 2800000):
ID X Y Z value_1 value_2
1 3 2 6 22 54
6 11 5 9 52 71
3 7 2 5 2 34
5 10 7 1 23 47
And after spliting each row, we can get:
ID X Y Z
1 3 2 6
22 54 NA NA
6 11 5 9
52 71 NA NA
3 7 2 5
2 34 NA NA
5 10 7 1
23 47 NA NA
the "value_1" and "value_2" columns are split and each element is set to a new row. For example, value_1 = 22 and value_2 = 54 are set to a new row.
Here is one option with data.table. We convert the 'data.frame' to 'data.table' by creating a column of rownames (setDT(df1, keep.rownames = TRUE)). Subset the columns 1:5 and 1, 6, 7 in a list, rbind the list element with fill = TRUE option to return NA for corresponding columns that are not found in one of the datasets, order by the row number ('rn') and assign (:=) the row number column to 'NULL'.
library(data.table)
setDT(df1, keep.rownames = TRUE)[]
rbindlist(list(df1[, 1:5, with = FALSE], setnames(df1[, c(1, 6:7),
with = FALSE], 2:3, c("ID", "X"))), fill = TRUE)[order(rn)][, rn:= NULL][]
# ID X Y Z
#1: 1 3 2 6
#2: 22 54 NA NA
#3: 6 11 5 9
#4: 52 71 NA NA
#5: 3 7 2 5
#6: 2 34 NA NA
#7: 5 10 7 1
#8: 23 47 NA NA
A hadleyverse corresponding to the above logic would be
library(dplyr)
tibble::rownames_to_column(df1[1:4]) %>%
bind_rows(., setNames(tibble::rownames_to_column(df1[5:6]),
c("rowname", "ID", "X"))) %>%
arrange(rowname) %>%
select(-rowname)
# ID X Y Z
#1 1 3 2 6
#2 22 54 NA NA
#3 6 11 5 9
#4 52 71 NA NA
#5 3 7 2 5
#6 2 34 NA NA
#7 5 10 7 1
#8 23 47 NA NA
data
df1 <- structure(list(ID = c(1L, 6L, 3L, 5L), X = c(3L, 11L, 7L, 10L
), Y = c(2L, 5L, 2L, 7L), Z = c(6L, 9L, 5L, 1L), value_1 = c(22L,
52L, 2L, 23L), value_2 = c(54L, 71L, 34L, 47L)), .Names = c("ID",
"X", "Y", "Z", "value_1", "value_2"), class = "data.frame",
row.names = c(NA, -4L))
Here's a (very slow) pure R solution using no extra packages:
# Replicate your matrix
input_df <- data.frame(ID = rnorm(10000),
X = rnorm(10000),
Y = rnorm(10000),
Z = rnorm(10000),
value_1 = rnorm(10000),
value_2 = rnorm(10000))
# Preallocate memory to a data frame
output_df <- data.frame(
matrix(
nrow = nrow(input_df)*2,
ncol = ncol(input_df)-2))
# Loop through each row in turn.
# Put the first four elements into the current
# row, and the next two into the current+1 row
# with two NAs attached.
for(i in seq(1, nrow(output_df), 2)){
output_df[i,] <- input_df[i, c(1:4)]
output_df[i+1,] <- c(input_df[i, c(5:6)],NA,NA)
}
colnames(output_df) <- c("ID", "X", "Y", "Z")
Which results in
> head(output_df)
X1 X2 X3 X4
1 0.5529417 -0.93859275 2.0900276 -2.4023800
2 0.9751090 0.13357075 NA NA
3 0.6753835 0.07018647 0.8529300 -0.9844643
4 1.6405939 0.96133195 NA NA
5 0.3378821 -0.44612782 -0.8176745 0.2759752
6 -0.8910678 -0.37928353 NA NA
This should work
data <- read.table(text= "ID X Y Z value_1 value_2
1 3 2 6 22 54
6 11 5 9 52 71
3 7 2 5 2 34
5 10 7 1 23 47", header=T)
data1 <- data[,1:4]
data2 <- setdiff(data,data1)
names(data2) <- names(data1)[1:ncol(data2)]
combined <- plyr::rbind.fill(data1,data2)
n <- nrow(data1)
combined[kronecker(1:n, c(0, n), "+"),]
Though why you would need to do this beats me.