I have 4 indicator and an id row. How I can make a column with respect of this indicator like this:
ID. col1. col2. col3. col4.
1 1 0 0 0
2 0 0 0 1
3 0 0 1 0
I n each row just one of the columns is 1 and others 0. new column is 1 if col1 is 1, is 2 if col2 is 1 , is 3 if col3 is 1 and is 4 if col4 is 1.
so the output is
ID. col.
1 1
2 4
3 3
An option is max.col
cbind(df1[1], `col.` = max.col(df1[-1], "first"))
# ID. col.
#1 1 1
#2 2 4
#3 3 3
If there are no 1s in the rows, create a logical condition to return that row as NA
df1[2, 5] <- 0
cbind(df1[1], `col.` = max.col(df1[-1], "first") * NA^ !rowSums(df1[-1] == 1))
data
df1 <- structure(list(ID. = 1:3, col1. = c(1L, 0L, 0L), col2. = c(0L,
0L, 0L), col3. = c(0L, 0L, 1L), col4. = c(0L, 1L, 0L)),
class = "data.frame", row.names = c(NA,
-3L))
Related
My dataframe has columns and rows like this
Id Date Col1 Col2 Col3 X1
1 1/1/22 NA 1 0
1 1/1/22 0 0 1 6
2 5/7/21 0 1 0
2 5/7/21 0 2 0
I like to drop rows where the duplicate row (same Id, same date) where values for column X1 is missing or empty. If both the rows are missing X1 for that ID and date then dont drop. Only when one is missing and other is not missing then drop the missing row.
Expected output
Id Date Col1 Col2 Col3 X1
1 1/1/22 0 0 1 6
2 5/7/21 0 1 0
2 5/7/21 0 2 0
I tried this
library(tidyr)
df %>%
group_by(Id, Date) %>%
drop_na(X1)
This drops all rows with NA or missing and I am just left with one row, which is not what I want. Any suggestions much apricated. Thanks.
We can create a condition in filter to return all the rows if there are only missing values in 'X1' or just remove the missing rows
library(dplyr)
df %>%
group_by(Id, Date) %>%
filter(if(all(is.na(X1))) TRUE else complete.cases(X1)) %>%
ungroup
-output
# A tibble: 3 × 6
Id Date Col1 Col2 Col3 X1
<int> <chr> <int> <int> <int> <int>
1 1 1/1/22 0 0 1 6
2 2 5/7/21 0 1 0 NA
3 2 5/7/21 0 2 0 NA
Or without the if/else, use | with & condition
df %>%
group_by(Id, Date) %>%
filter(any(complete.cases(X1)) & complete.cases(X1) |
all(is.na(X1))) %>%
ungroup
data
df <- structure(list(Id = c(1L, 1L, 2L, 2L), Date = c("1/1/22", "1/1/22",
"5/7/21", "5/7/21"), Col1 = c(NA, 0L, 0L, 0L), Col2 = c(1L, 0L,
1L, 2L), Col3 = c(0L, 1L, 0L, 0L), X1 = c(NA, 6L, NA, NA)),
class = "data.frame", row.names = c(NA,
-4L))
I want to replace multiple columns of a data frame by one column each for each group whereas I also want to change the numbers. Example:
A1 A2 A3 A4 B1 B2 B3
1 1 1 0 1 1 0 0
2 1 0 1 1 0 1 1
3 1 1 1 1 0 1 1
4 0 0 1 0 0 0 1
5 0 0 0 0 0 1 0
I want to sort this data frame by it's headers meaning I only want one column "A" instead of 4 here and only column "B" instead of 3 here. The numbers should change with the following pattern: If you are in group "A2" and the observation has the number "1" it should be changed into a "2" instead. If you are in group "A3" and the observation has the number "1" it should be changed into a "3" instead. The end result should be that I want to contain the highest number in that specific column and row (if I have 3 "1"s in my row and group, the number which is going to replace all of them is going to be the one of the highest group)
If the number is 0 then nothing changes. Here is the result I'm looking for:
A B
1 4 1
2 4 3
3 4 3
4 3 3
5 0 2
How can I replace all of these groups by a single column each? (one column for each group)
So far I've tried a lot with the function unite(data= testdata, col= "A") for example, but doing this manually would take too long. There has to be a better way, right?
Thanks in advance!
You can do:
dat <- read.table(header=TRUE, text=
"A1 A2 A3 A4 B1 B2 B3
1 1 1 0 1 1 0 0
2 1 0 1 1 0 1 1
3 1 1 1 1 0 1 1
4 0 0 1 0 0 0 1
5 0 0 0 0 0 1 0")
myfu <- function(x) if (any(x)) max(which(x)) else 0
new <- data.frame(
A=apply(dat[, 1:4]==1, 1, myfu),
B=apply(dat[, 5:7]==1, 1, myfu))
new
A more general solution:
new2 <- data.frame(
A=apply(dat[, grepl("^A", names(dat))]==1, 1, myfu),
B=apply(dat[, grepl("^B", names(dat))]==1, 1, myfu))
new2
You can try the code like below
dfout <- as.data.frame(
lapply(
split.default(df, gsub("\\d+$", "", names(df))),
function(v) max.col(v, ties.method = "last") * +(rowSums(v) >= 1)
)
)
such that
> dfout
A B
1 4 1
2 4 3
3 4 3
4 3 3
5 0 2
Data
df <- structure(list(A1 = c(1L, 1L, 1L, 0L, 0L), A2 = c(1L, 0L, 1L,
0L, 0L), A3 = c(0L, 1L, 1L, 1L, 0L), A4 = c(1L, 1L, 1L, 0L, 0L
), B1 = c(1L, 0L, 0L, 0L, 0L), B2 = c(0L, 1L, 1L, 0L, 1L), B3 = c(0L,
1L, 1L, 1L, 0L)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5"))
assuming your data is in a data.frame called df1 this works in Base-R
df1 <- t(df1)*as.numeric(regmatches(colnames(df1), regexpr("\\d+$", colnames(df1))))
df1 <- split(as.data.frame(df1),sub("\\d+$","",row.names(df1)))
df1 <- sapply(df1, apply, 2, max)
output:
> df1
A B
1 4 1
2 4 3
3 4 3
4 3 3
5 0 2
I have a matrix of bundles and I would like to subset it based on the column sum (a budget) and the first value. If the first value is 0 and I could add the value in and still be under the budget I would like to drop the column.
For example, if my budget is 10 (column sum) and my matrix looks like this:
col1 col2 col3 col4
1 2 2 0 0
2 3 3 3 3
3 0 0 2 0
4 4 0 4 0
I would like the end matrix to look like this because the 0 in col4 row 1 could be included and the column sum would be under 10:
col1 col2 col3
1 2 2 0
2 3 3 3
3 0 0 2
4 4 0 4
My code is currently:
for (i in 1:ncol(df)) {
if (df[1,i]==0) {
df<-df[,which(colSums(df)+2>10)]
}
}
The code is not working because it also removes column 2. I don't think it is considering the if statement when subsetting the matrix.
Thanks.
Similar to the solution by #akrun, but I think the following subset approach already can make it
> df[,head(df,1)!=0 | colSums(df)+2>10]
col1 col2 col3
1 2 2 0
2 3 3 3
3 0 0 2
4 4 0 4
DATA
df <- structure(list(col1 = c(2L, 3L, 0L, 4L), col2 = c(2L, 3L, 0L,
0L), col3 = c(0L, 3L, 2L, 4L), col4 = c(0L, 3L, 0L, 0L)),
class = "data.frame", row.names = c("1",
"2", "3", "4"))
One option is to create the condition with colSums and the value in first row to subset the columns. colSums would be more efficient
bids <- 2
df1[which(!(df1[1,] == 0 & (colSums(df1) + bids) < 10))]
# col1 col2 col3
#1 2 2 0
#2 3 3 3
#3 0 0 2
#4 4 0 4
Or using the for loop
for(i in seq_along(df1)) if(df1[1, i] == 0 & sum(df1[[i]]) + bids < 10) df1[[i]] <- NULL
data
df1 <- structure(list(col1 = c(2L, 3L, 0L, 4L), col2 = c(2L, 3L, 0L,
0L), col3 = c(0L, 3L, 2L, 4L), col4 = c(0L, 3L, 0L, 0L)),
class = "data.frame", row.names = c("1",
"2", "3", "4"))
I am playing around with binary data.
I have data in columns in the following manner:
A B C D E F G H I J K L M N
-----------------------------------------------------
1 1 1 1 1 1 1 1 1 0 0 0 0 0
0 0 0 0 1 1 1 0 1 1 0 0 1 0
0 0 0 0 0 0 0 1 1 1 1 1 0 0
1 - Indicating that the system was on and 0 indicating that the system was off
I am trying to figure out ways to figure out a way to summarize the gaps between the on/off transition of these systems.
For example,
for the first row, it stops working after 'I'
for the second row, it works from 'E' to 'G' and then works again in 'I' and 'M' but is off during other.
Is there a way to summarize this?
I wish to see my result in the following form
row-number Number of 1's Range
------------ ------------------ ------
1 9 A-I
2 3 E-G
2 2 I-J
2 1 M
3 5 H-L
Here's a tidyverse solution:
library(tidyverse)
df %>%
rowid_to_column() %>%
gather(col, val, -rowid) %>%
group_by(rowid) %>%
# This counts the number of times a new streak starts
mutate(grp_num = cumsum(val != lag(val, default = -99))) %>%
filter(val == 1) %>%
group_by(rowid, grp_num) %>%
summarise(num_1s = n(),
range = paste0(first(col), "-", last(col)))
## A tibble: 5 x 4
## Groups: rowid [3]
# rowid grp_num num_1s range
# <int> <int> <int> <chr>
#1 1 1 9 A-I
#2 2 2 3 E-G
#3 2 4 2 I-J
#4 2 6 1 M-M
#5 3 2 5 H-L
An option with data.table. Convert the 'data.frame' to 'data.table' while creating a row number column (setDT), melt from 'wide' to 'long' format specifying the id.var as row number column 'rn', create a run-lenght-id (rleid) column on the 'value' column grouped by 'rn', subset the rows where 'value' is 1, summarise with number of rows (.N), and pasted range of 'variable' values, grouped by 'grp' and 'rn', assign the columns not needed to NULL and order by 'rn' if necessary.
library(data.table)
melt(setDT(df1, keep.rownames = TRUE), id.var = 'rn')[,
grp := rleid(value), rn][value == 1, .(NumberOfOnes = .N,
Range = paste(range(as.character(variable)), collapse="-")),
.(grp, rn)][, grp := NULL][order(rn)]
# rn NumberOfOnes Range
#1: 1 9 A-I
#2: 2 3 E-G
#3: 2 2 I-J
#4: 2 1 M-M
#5: 3 5 H-L
Or using base R with rle
do.call(rbind, apply(df1, 1, function(x) {
rl <- rle(x)
i1 <- rl$values == 1
l1 <- rl$lengths[i1]
nm1 <- tapply(names(x), rep(seq_along(rl$values), rl$lengths),
FUN = function(y) paste(range(y), collapse="-"))[i1]
data.frame(NumberOfOnes = l1, Range = nm1)}))
data
df1 <- structure(list(A = c(1L, 0L, 0L), B = c(1L, 0L, 0L), C = c(1L,
0L, 0L), D = c(1L, 0L, 0L), E = c(1L, 1L, 0L), F = c(1L, 1L,
0L), G = c(1L, 1L, 0L), H = c(1L, 0L, 1L), I = c(1L, 1L, 1L),
J = c(0L, 1L, 1L), K = c(0L, 0L, 1L), L = c(0L, 0L, 1L),
M = c(0L, 1L, 0L), N = c(0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
I have data that looks like this
ID v1 v2
1 1 0
2 0 1
3 1 0
3 0 1
4 0 1
I want to replace all values with 'NA' if the ID occurs more than once in the dataframe. The final product should look like this
ID v1 v2
1 1 0
2 0 1
3 NA NA
3 NA NA
4 0 1
I could do this by hand, but I want R to detect all the duplicate cases (in this case two times ID '3') and replace the values with 'NA'.
Thanks for your help!
You could use duplicated() from either end, and then replace.
idx <- duplicated(df$ID) | duplicated(df$ID, fromLast = TRUE)
df[idx, -1] <- NA
which gives
ID v1 v2
1 1 1 0
2 2 0 1
3 3 NA NA
4 3 NA NA
5 4 0 1
This will also work if the duplicated IDs are not next to each other.
Data:
df <- structure(list(ID = c(1L, 2L, 3L, 3L, 4L), v1 = c(1L, 0L, 1L,
0L, 0L), v2 = c(0L, 1L, 0L, 1L, 1L)), .Names = c("ID", "v1",
"v2"), class = "data.frame", row.names = c(NA, -5L))
One more option:
df1[df1$ID %in% df1$ID[duplicated(df1$ID)], -1] <- NA
#> df1
# ID v1 v2
#1 1 1 0
#2 2 0 1
#3 3 NA NA
#4 3 NA NA
#5 4 0 1
data
df1 <- structure(list(ID = c(1L, 2L, 3L, 3L, 4L), v1 = c(1L, 0L, 1L,
0L, 0L), v2 = c(0L, 1L, 0L, 1L, 1L)), .Names = c("ID", "v1",
"v2"), class = "data.frame", row.names = c(NA, -5L))
Here is a base R method
# get list of repeated IDs
repeats <- rle(df$ID)$values[rle(df$ID)$lengths > 1]
# set the corresponding variables to NA
df[, -1] <- sapply(df[, -1], function(i) {i[df$ID %in% repeats] <- NA; i})
In the first line, we use rle to extract repeated IDs. In the second, we use sapply to loop through non-ID variables and replace IDs that repeat with NA for each variable.
Note that this assumes that the data set is sorted by ID. This may be accomplished with the order function. (df <- df[order(df$ID),]).
If the dataset is very large, you might break up the first function into two steps to avoid computing the rle twice:
dfRle <- rle(df$ID)
repeats <- dfRle$values[dfRle$lengths > 1]
data
df <- read.table(header=T, text="ID v1 v2
1 1 0
2 0 1
3 1 0
3 0 1
4 0 1")