I would like to splite each row of a data frame(numberic) into two rows. For example, part of the original data frame like this (nrow(original datafram) > 2800000):
ID X Y Z value_1 value_2
1 3 2 6 22 54
6 11 5 9 52 71
3 7 2 5 2 34
5 10 7 1 23 47
And after spliting each row, we can get:
ID X Y Z
1 3 2 6
22 54 NA NA
6 11 5 9
52 71 NA NA
3 7 2 5
2 34 NA NA
5 10 7 1
23 47 NA NA
the "value_1" and "value_2" columns are split and each element is set to a new row. For example, value_1 = 22 and value_2 = 54 are set to a new row.
Here is one option with data.table. We convert the 'data.frame' to 'data.table' by creating a column of rownames (setDT(df1, keep.rownames = TRUE)). Subset the columns 1:5 and 1, 6, 7 in a list, rbind the list element with fill = TRUE option to return NA for corresponding columns that are not found in one of the datasets, order by the row number ('rn') and assign (:=) the row number column to 'NULL'.
library(data.table)
setDT(df1, keep.rownames = TRUE)[]
rbindlist(list(df1[, 1:5, with = FALSE], setnames(df1[, c(1, 6:7),
with = FALSE], 2:3, c("ID", "X"))), fill = TRUE)[order(rn)][, rn:= NULL][]
# ID X Y Z
#1: 1 3 2 6
#2: 22 54 NA NA
#3: 6 11 5 9
#4: 52 71 NA NA
#5: 3 7 2 5
#6: 2 34 NA NA
#7: 5 10 7 1
#8: 23 47 NA NA
A hadleyverse corresponding to the above logic would be
library(dplyr)
tibble::rownames_to_column(df1[1:4]) %>%
bind_rows(., setNames(tibble::rownames_to_column(df1[5:6]),
c("rowname", "ID", "X"))) %>%
arrange(rowname) %>%
select(-rowname)
# ID X Y Z
#1 1 3 2 6
#2 22 54 NA NA
#3 6 11 5 9
#4 52 71 NA NA
#5 3 7 2 5
#6 2 34 NA NA
#7 5 10 7 1
#8 23 47 NA NA
data
df1 <- structure(list(ID = c(1L, 6L, 3L, 5L), X = c(3L, 11L, 7L, 10L
), Y = c(2L, 5L, 2L, 7L), Z = c(6L, 9L, 5L, 1L), value_1 = c(22L,
52L, 2L, 23L), value_2 = c(54L, 71L, 34L, 47L)), .Names = c("ID",
"X", "Y", "Z", "value_1", "value_2"), class = "data.frame",
row.names = c(NA, -4L))
Here's a (very slow) pure R solution using no extra packages:
# Replicate your matrix
input_df <- data.frame(ID = rnorm(10000),
X = rnorm(10000),
Y = rnorm(10000),
Z = rnorm(10000),
value_1 = rnorm(10000),
value_2 = rnorm(10000))
# Preallocate memory to a data frame
output_df <- data.frame(
matrix(
nrow = nrow(input_df)*2,
ncol = ncol(input_df)-2))
# Loop through each row in turn.
# Put the first four elements into the current
# row, and the next two into the current+1 row
# with two NAs attached.
for(i in seq(1, nrow(output_df), 2)){
output_df[i,] <- input_df[i, c(1:4)]
output_df[i+1,] <- c(input_df[i, c(5:6)],NA,NA)
}
colnames(output_df) <- c("ID", "X", "Y", "Z")
Which results in
> head(output_df)
X1 X2 X3 X4
1 0.5529417 -0.93859275 2.0900276 -2.4023800
2 0.9751090 0.13357075 NA NA
3 0.6753835 0.07018647 0.8529300 -0.9844643
4 1.6405939 0.96133195 NA NA
5 0.3378821 -0.44612782 -0.8176745 0.2759752
6 -0.8910678 -0.37928353 NA NA
This should work
data <- read.table(text= "ID X Y Z value_1 value_2
1 3 2 6 22 54
6 11 5 9 52 71
3 7 2 5 2 34
5 10 7 1 23 47", header=T)
data1 <- data[,1:4]
data2 <- setdiff(data,data1)
names(data2) <- names(data1)[1:ncol(data2)]
combined <- plyr::rbind.fill(data1,data2)
n <- nrow(data1)
combined[kronecker(1:n, c(0, n), "+"),]
Though why you would need to do this beats me.
Related
I have a dataframe (DF) with 4 columns. How do I make it so if column 4 is either a 0 or an NA, then remove the whole row? So in the example below only row 1 would be left.
Column 1 Column 2 Column 3 Column 4
11 24 234 2123
45 63 22 0
234 234 123 NA
using dplyr
library(dplyr)
df %>% filter(!is.na(Column.4) & Column.4 != 0)
You can use logical vectors to subset your data:
df[!is.na(df[,4]) & (df[,4]!=0), ]
Example:
df = data.frame(x = rnorm(30), y = rnorm(30), z = rnorm(30), a = rep(c(0,1,NA),10))
x y z a
2 -0.21772820 -0.5337648 -1.07579623 1
5 0.64536474 0.2011776 -0.12981424 1
8 2.36411372 0.0343823 2.03561701 1
11 1.09103526 -1.9287689 0.59511269 1
14 0.32482389 -0.5562136 -0.38943092 1
17 0.63621067 -1.6517097 -0.09804529 1
20 2.61892085 1.5575784 -0.50803567 1
23 0.07854647 1.1861483 -0.49798074 1
26 0.19561725 1.1036331 -0.66349688 1
29 0.22470875 -0.4192745 0.09153176 1
You can use sapply to loop thru each row and it will display the rows the rows that satisfy the underlying conditions:
df[sapply(1:nrow(df), function(i) all(!is.na(df[i,])) & all(df[i,] != 0)), ]
Data:
structure(list(Column.1 = c(11L, 45L, 234L), Column.2 = c(24L,
63L, 234L), Column.3 = c(234L, 22L, 123L), Column.4 = c(2123L,
0L, NA)), class = "data.frame", row.names = c(NA, -3L)) -> df
Output:
# Column.1 Column.2 Column.3 Column.4
# 1 11 24 234 2123
I have a data frame that is sorted based on one column(numeric column) to assign the rank. if this column value is zero then arrange the data frame based on another character column for those rows which have zero as a value in a numeric column.
But to give rank I have to consider var2 that is the reason I sorted based on var2, if there is any identical values in var2 for those rows I have to consider var3 to give rank. please see the data frame 2 and 3 rows, var2 values are identical in that case i have to consider var3 to give rank. In case var2 is zero i have to sort the var1 column(character column) in alphabetical order and give rank. if var2 is NA no rank. please refer the data frame given below.
Below, the data frame is sorted based on var2 column descending order, but var2 contains zero also if var2 is zero I have to sort the data frame based on var1 for the rows which are having zero in var2. I need sort by var1 for those rows which are having var2 as zero and followed by NA in alphabetical order of var1.
example:
# var1 var2 var3 rank
# 1 c 556 45 1
# 2 a 345 35 3
# 3 f 345 64 2
# 4 b 134 87 4
# 5 z 0 34 5
# 6 d 0 32 6
# 7 c 0 12 7
# 8 a 0 23 8
# 9 e NA
# 10 b NA
below is my code
df <- data.frame(var1=c("c","a","f","b","z","d", "c","a", "e", "b", "ad", "gf", "kg", "ts", "mp"), var2=c(134, NA,345, 200, 556,NA, 345, 200, 150, 0, 25,10,0,150,0), var3=c(65,'',45,34,68,'',73,12,35,23,34,56,56,78,123))
# To break the tie between var3 and var2
orderdf <- df[order(df$var2, df$var1, decreasing = TRUE), ]
#assigning rank
rankdf <- orderdf %>% mutate(rank = ifelse(is.na(var2),'', seq(1:nrow(orderdf))))
expected output is sort the var1 in alphabetical order if var2 value is zero(for those rows with var2 value is zero)
expected output:
# var1 var2 var3 rank
# 1 c 556 45 1
# 2 a 345 35 3
# 3 f 345 64 2
# 4 b 134 87 4
# 5 a 0 34 5
# 6 c 0 32 6
# 7 d 0 12 7
# 8 z 0 23 8
# 9 b NA
# 10 e NA
With dplyr you can use
df %>%
arrange(desc(var2), var1)
and afterwards you create the column rank
EDIT
The following code is a bit cumbersome but it gets the job done. Basically it orders the rows in which var2 is equal or different from zero separately, then combines the two ordered dataframes together and finally creates the rank column.
Data
df <- data.frame(
var1 = c("c","a","f","b","z","d", "c","a", "e", "z", "ad", "gf", "kg", "ts", "mp"),
var2 = c(134, NA,345, 200, 556,NA, 345, 200, 150, 0, 25,10,0,150,0),
var3 = as.numeric(c(65,'',45,34,68,'',73,12,35,23,34,56,56,78,123))
)
df
# var1 var2 var3
# 1 c 134 65
# 2 a NA NA
# 3 f 345 45
# 4 b 200 34
# 5 z 556 68
# 6 d NA NA
# 7 c 345 73
# 8 a 200 12
# 9 e 150 35
# 10 z 0 23
# 11 ad 25 34
# 12 gf 10 56
# 13 kg 0 56
# 14 ts 150 78
# 15 mp 0 123
Code
df %>%
# work on rows with var2 different from 0 or NA
filter(var2 != 0) %>%
arrange(desc(var2), desc(var3)) %>%
# merge with rows with var2 equal to 0 or NA
bind_rows(df %>% filter(var2 == 0 | is.na(var2)) %>% arrange(var1)) %>%
arrange(desc(var2)) %>%
# create the rank column only for the rows with var2 different from NA
mutate(
rank = seq_len(nrow(df)),
rank = ifelse(is.na(var2), NA, rank)
)
Output
# var1 var2 var3 rank
# 1 z 556 68 1
# 2 c 345 73 2
# 3 f 345 45 3
# 4 b 200 34 4
# 5 a 200 12 5
# 6 ts 150 78 6
# 7 e 150 35 7
# 8 c 134 65 8
# 9 ad 25 34 9
# 10 gf 10 56 10
# 11 kg 0 56 11
# 12 mp 0 123 12
# 13 z 0 23 13
# 14 a NA NA NA
# 15 d NA NA NA
Using only base R's order() function, sort first on descending order of var2 then ascending order of var1 to sort the data by passing the subsequent integer vector to square braces
df[order(-df$var2, df$var1), ]
Adding a rank column too is then just
df[order(-df$var2, df$var1), "rank"] <- 1:length(df$var1)
Using data.table
library(data.table)
setDT(df)[order(-var2, var1)][, rank := seq_len(.N)][]
data
df <- structure(list(var1 = structure(c(3L, 1L, 6L, 2L, 7L, 4L, 3L,
1L, 5L, 2L), .Label = c("a", "b", "c", "d", "e", "f", "z"), class = "factor"),
var2 = c(1456L, 456L, 345L, 134L, 0L, 0L, 0L, 0L, NA, NA)),
class = "data.frame", row.names = c(NA, -10L))
You can do it in base R, using order :
cols <- c('var1', 'var2')
remaining_cols <- setdiff(names(df), cols)
df1 <- df[cols]
cbind(transform(df1[with(df1, order(-var2, var1)), ],
rank = seq_len(nrow(df1))), df[remaining_cols])
# var1 var2 rank var3
#1 c 556 1 45
#2 a 345 2 35
#3 f 345 3 64
#4 b 134 4 87
#8 a 0 5 34
#7 c 0 6 32
#6 d 0 7 12
#5 z 0 8 23
#10 b NA 9 10
#9 e NA 10 11
data
df <- structure(list(var1 = structure(c(3L, 1L, 6L, 2L, 7L, 4L, 3L,
1L, 5L, 2L), .Label = c("a", "b", "c", "d", "e", "f", "z"), class = "factor"),
var2 = c(556L, 345L, 345L, 134L, 0L, 0L, 0L, 0L, NA, NA),
var3 = c(45L, 35L, 64L, 87L, 34L, 32L, 12L, 23L, 10L, 11L
)), class = "data.frame", row.names = c(NA, -10L))
Given the data:
df <- structure(list(cola = structure(c(5L, 9L, 6L, 2L, 7L, 10L, 3L,
8L, 1L, 4L), .Label = c("a", "b", "d", "g", "q", "r", "t", "w",
"x", "z"), class = "factor"), colb = c(156L, 8L, 6L, 100L, 49L,
31L, 189L, 77L, 154L, 171L), colc = c(0.207140279468149, 0.51990159181878,
0.402017514919862, 0.382948065642267, 0.488511856179684, 0.263168515404686,
0.38591041485779, 0.774066215148196, 0.763264901703224, 0.474355421960354
), cold = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("a",
"b"), class = "factor")), class = "data.frame", row.names = c(NA,
-10L))
df
# cola colb colc cold
# 1 q 156 0.2071403 a
# 2 x 8 0.5199016 b
# 3 r 6 0.4020175 a
# 4 b 100 0.3829481 b
# 5 t 49 0.4885119 a
# 6 z 31 0.2631685 b
# 7 d 189 0.3859104 a
# 8 w 77 0.7740662 b
# 9 a 154 0.7632649 a
# 10 g 171 0.4743554 b
If the value in colc in a particular row is >= 0.5, I would like to replace the contents of all the other cells in that row with NA, except for the contents of cold for that row (which I would like to retain as it is).
I attempted this with dplyr::mutate_at() and base::ifelse(), and it works fine:
df %>% mutate_at(vars(-c(cold)), funs(ifelse(colc >= 0.5, NA, .)))
# cola colb colc cold
# 1 5 156 0.2071403 a
# 2 NA NA NA b
# 3 6 6 0.4020175 a
# 4 2 100 0.3829481 b
# 5 7 49 0.4885119 a
# 6 10 31 0.2631685 b
# 7 3 189 0.3859104 a
# 8 NA NA NA b
# 9 NA NA NA a
# 10 4 171 0.4743554 b
But I would like to do this with dplyr::case_when(), as I might have more than one replacement condition to fulfill (e.g., replace with "foo" if colc < 0.5 & colc >= 0.3. But case_when() does not appear to be playing nice:
df %>% mutate_at(vars(-c(cold)), funs(case_when(colc >= 0.5 ~ NA, TRUE ~ .)))
Error: must be a logical vector, not a factor object
Why is this happening and what can I do to fix it? I assume this is because I am trying to convert multiple columns with different data types to NA. I tried to look for a solution online, but I wasn't able to find one.
Edit: in specific, I would like to preserve the data types of the various columns as they are.
library(dplyr)
df %>%
mutate_at(vars(-c(cold)), ~ case_when(colc >= 0.5 ~ `is.na<-`(., TRUE), TRUE ~ .))
# cola colb colc cold
# 1 q 156 0.2071403 a
# 2 <NA> NA NA b
# 3 r 6 0.4020175 a
# 4 b 100 0.3829481 b
# 5 t 49 0.4885119 a
# 6 z 31 0.2631685 b
# 7 d 189 0.3859104 a
# 8 <NA> NA NA b
# 9 <NA> NA NA a
# 10 g 171 0.4743554 b
Description
When you use case_when to assign NA, you need to specify the type of NA, i.e. NA_integer_, NA_real_, NA_complex_ and NA_character_. However, mutate_at transforms multiple columns at the same time and these columns have different types, so you cannot apply one statement on all columns. Ideally, there might exist something like NA_guess to identify the type, but I don't find that so far. This method is a little tricky. I use is.na() to convert the input vector to NAs, and these NAs will be the same type as the input vector. For example:
x <- 1:5
is.na(x) <- TRUE ; x
# [1] NA NA NA NA NA
class(x)
# [1] "integer"
y <- letters[1:5]
is.na(y) <- TRUE ; y
# [1] NA NA NA NA NA
class(y)
# [1] "character"
Work around similar to #NelsonGon :
library(dplyr)
df %>%
mutate_all(as.character) %>%
mutate_at(vars(-c(cold)),
~case_when(colc >= 0.5 ~ NA_character_, # ifelse(is.numeric(.), NA_real_, NA_character_),
TRUE ~ .
)
) %>%
mutate(colb = as.numeric(colb),
colc = as.numeric(colc)
)
#> cola colb colc cold
#> 1 q 156 0.2071403 a
#> 2 <NA> <NA> NA b
#> 3 r 6 0.4020175 a
#> 4 b 100 0.3829481 b
#> 5 t 49 0.4885119 a
#> 6 z 31 0.2631685 b
#> 7 d 189 0.3859104 a
#> 8 <NA> <NA> NA b
#> 9 <NA> <NA> NA a
#> 10 g 171 0.4743554 b
I have the Following dataset where some entries (unique A) Don't have data in B and others that have sometimes.
A B
1 NA
2 NA
3 77
1 NA
2 81
I want to delete the entries that Always have NA and keep the rest
A B
2 NA
3 77
2 81
We can use ave grouped by A and remove the groups that has all NAs
df[!with(df, ave(is.na(B), A, FUN = all)), ]
# A B
#2 2 NA
#3 3 77
#5 2 81
Using the same logic with dplyr
library(dplyr)
df %>%
group_by(A) %>%
filter(!all(is.na(B)))
Assuming the input shown reproducibly in the Note at the end, for each group defined by A we return TRUE if any of its elements in B are not NA.
subset(DF, ave(!is.na(B), A, FUN = any))
Note
Lines <- "
A B
1 NA
2 NA
3 77
1 NA
2 81"
DF <- read.table(text = Lines, header = TRUE)
We can use data.table
library(data.table)
setDT(df1)[, .SD[any(!is.na(B))], A]
# A B
#1: 2 NA
#2: 2 81
#3: 3 77
data
df1 <- structure(list(A = c(1L, 2L, 3L, 1L, 2L), B = c(NA, NA, 77L,
NA, 81L)), class = "data.frame", row.names = c(NA, -5L))
I'm looking for a way to refer to a pevious row in my data frame that has one column value in common with the 'current row'. Basically, if this would be my data frame
A B D
1 10
4 5
6 6
3 25
1 40
I would want D(i) to contain the B value of the last row for which A has the same value as A(i). So for the last row that should be 10.
You could try this:
for(i in seq_len(nrow(dat))) {
try(dat$D[i] <- dat$B[tail(which(dat$A[1:i-1] == dat$A[i]),1)],silent=TRUE)
}
Results:
> dat
A B D
1 1 10 NA
2 4 5 NA
3 6 6 NA
4 3 25 NA
5 1 40 10
Data:
dat <- read.csv(text="A,B,D
1,10
4,5
6,6
3,25
1,40")
You may try
library(dplyr)
df1%>%
group_by(A) %>%
mutate(D=lag(B))
# A B D
#1 1 10 NA
#2 4 5 NA
#3 6 6 NA
#4 3 25 NA
#5 1 40 10
Or
library(data.table)#data.table_1.9.5
setDT(df1)[, D:=shift(B), A][]
data
df1 <- structure(list(A = c(1L, 4L, 6L, 3L, 1L), B = c(10L, 5L, 6L,
25L, 40L)), .Names = c("A", "B"), class = "data.frame",
row.names = c(NA, -5L))