how to change Na with other columns? - r

I Have 2 indicator:
licence age.6-17
Na 1
1 0
Na 0
0 1
how I can change Na to 1 if a person is more than 17 years (that is second column is 0) old and 0 otherwise?
output
licence age.6-17
0 1
1 0
1 0
0 1

using dplyr and ifelse
yourdata %>% mutate(licence = ifelse(`age.6-17` == 0, 1,0))
No need to change how the nature of "Na" nor the column name.
In addition, in case you would need to replace only the "Na" cells, considering "Na" is a string here
yourdata %>% mutate(licence = ifelse(licence == "Na" & `age.6-17` == 0, 1,0))
If however it is <NA> you would need is.na(licence) instead of licence == "Na"

In base you can subset with is.na and then subtract the value of age.6.17 from 1.
x <- read.table(header=T, na.string="Na", text="licence age.6-17
Na 1
1 0
Na 0
0 1")
idx <- is.na(x$licence)
x$licence[idx] <- 1-x$age.6.17[idx]
x
# licence age.6.17
#1 0 1
#2 1 0
#3 1 0
#4 0 1
or in case you ignore what is actualy stored in column licence you can use:
with(x, data.frame(licence=1-age.6.17, age.6.17))
# licence age.6.17
#1 0 1
#2 1 0
#3 1 0
#4 0 1

Assuming your NAs are actual NA we can use case_when in dplyr and apply the conditions.
library(dplyr)
df %>%
mutate(licence = case_when(is.na(licence) & age.6.17 == 0 ~ 1L,
is.na(licence) & age.6.17 == 1 ~ 0L,
TRUE ~ licence))
# licence age.6.17
#1 0 1
#2 1 0
#3 1 0
#4 0 1
data
df <- structure(list(licence = c(NA, 1L, NA, 0L), age.6.17 = c(1L,
0L, 0L, 1L)), class = "data.frame", row.names = c(NA, -4L))

Related

restart counting under conditions in R [duplicate]

This question already has answers here:
Create counter within consecutive runs of certain values
(6 answers)
Closed 3 years ago.
I have a flag column that contains continuous streams 1s and 0s. I want to add the stream of 1s. When it encounters 0s, the summing should stop. For the next stream of 1s, summing should start afresh
I have tried cumsum(negread_flag == 1) this continues to sum after the 0s
negread_flag result
1 1
1 2
1 3
1 4
0 0
0 0
0 0
1 1
1 2
1 3
0 0
We can make use of rleid (run-length-id - to generate different ids when the adjacent element differ) as a grouping variable, then get the sequence of the group and assign it to 'result' where 'negread_flag' is 1, remove the 'grp' column by assigning it to NULL
library(data.table)
setDT(df1)[, grp := rleid(negread_flag)
][, result := 0
][negread_flag == 1,
result := seq_len(.N), grp][, grp := NULL][]
# negread_flag result
# 1: 1 1
# 2: 1 2
# 3: 1 3
# 4: 1 4
# 5: 0 0
# 6: 0 0
# 7: 0 0
# 8: 1 1
# 9: 1 2
#10: 1 3
#11: 0 0
Or a similar idea with tidyverse, using the rleid (from data.table), create the 'result' by multiplying the row_number() with the 'negread_flag' so that values corresponding to 0 in 'negread_flag' becomes 0
library(tidyverse)
df1 %>%
group_by(grp = rleid(negread_flag)) %>%
mutate(result = row_number() * negread_flag) %>%
ungroup %>%
select(-grp)
# A tibble: 11 x 2
# negread_flag result
# <int> <int>
# 1 1 1
# 2 1 2
# 3 1 3
# 4 1 4
# 5 0 0
# 6 0 0
# 7 0 0
# 8 1 1
# 9 1 2
#10 1 3
#11 0 0
Or using base R
i1 <- df1$negread_flag != 0
df1$result[i1] <- with(rle(df1$negread_flag), sequence(lengths * values))
Or as #markus commented
df1$result[i1] <- sequence(rle(df1$negread_flag)$lengths) * df1$negread_flag
data
df1 <- structure(list(negread_flag = c(1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L,
1L, 1L, 0L)), row.names = c(NA, -11L), class = "data.frame")

remove rows from dataframe based on value, ignoring NAs

I have a dataframe that I would like to delete rows from, based on a value in a specific column.
As an example, the dataframe appears something like this:
a b c d
1 1 2 3 0
2 4 NA 1 NA
3 6 4 0 1
4 NA 5 0 0
I would like to remove all rows with a value greater than 0 in column d. I have been trying to use the following code to do this:
df <- df[!df$d > 0, ]
but this is appearing to have the effect of deleting all the value is rows with an NA value in column d. I was assuming that a na.rm = TRUEargument was needed but I wasn't sure where to fit it in the function above.
Cheers,
Ant
We need to select the rows where d is not greater than 0 OR there is NA in d
df[with(df, !d > 0 | is.na(d)), ]
# a b c d
#1 1 2 3 0
#2 4 NA 1 NA
#4 NA 5 0 0
Or we can also use subset
subset(df, !d > 0 | is.na(d))
or dplyr filter
library(dplyr)
df %>% filter(!d > 0 | is.na(d))
The !d > 0 part can also be reversed to
subset(df, d < 1 | is.na(d))
to get the same result.
We can construct the logical vector with complete.cases
subset(df, !d > 0 | complete.cases(d))
# a b c d
#1 1 2 3 0
#3 6 4 0 1
#4 NA 5 0 0
Or use subset with replace
subset(df, !replace(d, is.na(d), 0) > 0)
Or with tidyverse
library(tidyverse)
df %>%
filter(!replace_na(d, 0) >0)
which is slightly different from the method mentioned here or here
data
df <- structure(list(a = c(1L, 4L, 6L, NA), b = c(2L, NA, 4L, 5L),
c = c(3L, 1L, 0L, 0L), d = c(0L, NA, 1L, 0L)), class = "data.frame",
row.names = c("1", "2", "3", "4"))
If u add a |all rows that has a NA will match. The condition !df$d > 0 will get executed for those in d that are not a NA. So I think you were looking for:
df[is.na(df$d) | !df$d > 0, ]
Wheras, the below wont include the rows that has a NA in column d and that does not match the condition !df$d > 0
df[!is.na(df$d) & !df$d > 0, ]

Rearrange and Sort

I have the following data
ID v1 v2 v3 v4 v5
1 1 3 6 4
2 4 2
3 3 1 8 5
4 2 5 3 1
Can I rearrange the data so that it will automatically create new columns and assign binary value (1 or 0) according to the value in each variable (v1 to v5)?
E.g. In first row, I have values of 1,3,4 and 6. Can R automatically create 6 dummy variables to have assign the value to the respective column as below:
ID dummy1 dummy2 dummy3 dummy4 dummy5 dummy6
1 1 0 1 1 0 1
To have something like this:
ID c1 c2 c3 c4 c5 c6 c7 c8
1 1 0 1 1 0 1 0 0
2 0 1 0 1 0 0 0 0
3 1 0 1 0 1 0 0 1
4 1 1 1 0 1 0 0 0
Thanks.
We can use base R to do this. Loop through the rows of the dataset except the first column, get the sequence of max value in the row, check how many of these are in the row and convert it to integer with as.integer, append NAs at the end to make the lengths same in the list output and cbind with the first column
lst <- apply(df[-1], 1, function(x) as.integer(seq_len(max(x, na.rm = TRUE)) %in% x))
res <- cbind(df[1], do.call(rbind, lapply(lst, `length<-`, max(lengths(lst)))))
res[is.na(res)] <- 0
colnames(res)[-1] <- paste0('c', 1:8)
res
# ID c1 c2 c3 c4 c5 c6 c7 c8
#1 1 1 0 1 1 0 1 0 0
#2 2 0 1 0 1 0 0 0 0
#3 3 1 0 1 0 1 0 0 1
#4 4 1 1 1 0 1 0 0 0
In base R, you can use:
table(transform(cbind(mydf[1], stack(mydf[-1]))[1:2], values = factor(values, 1:8)))
## values
## ID 1 2 3 4 5 6 7 8
## 1 1 0 1 1 0 1 0 0
## 2 0 1 0 1 0 0 0 0
## 3 1 0 1 0 1 0 0 1
## 4 1 1 1 0 1 0 0 0
Note that you need to convert the stacked values to factor if you want the "7" to be included in the output. This applies to the "data.table" and "tidyverse" approaches as well.
Alternatively, you can try the following with "data.table":
library(data.table)
melt(as.data.table(mydf), "ID", na.rm = TRUE)[
, dcast(.SD, ID ~ factor(value, 1:8), fun = length, drop = FALSE)]
Or the following with the "tidyverse":
library(tidyverse)
mydf %>%
gather(var, val, -ID, na.rm = TRUE) %>%
select(-var) %>%
mutate(var = 1, val = factor(val, 1:8)) %>%
spread(val, var, fill = 0, drop = FALSE)
Sample data:
mydf <- structure(list(ID = 1:4, v1 = c(1L, 4L, 3L, 2L), v2 = c(3L, 2L,
1L, 5L), v3 = c(6L, NA, 8L, 3L), v4 = c(4L, NA, 5L, 1L), v5 = c(NA,
NA, NA, NA)), .Names = c("ID", "v1", "v2", "v3", "v4", "v5"), row.names = c(NA,
4L), class = "data.frame")
If automation is important, you can also use syntax like factor(value, sequence(max(value)) in the "data.table" approach or val = factor(val, sequence(max(val)))) in the "tidyverse" approach.
Another base R answer with some similarities to akrun's is
# create matrix of values
myMat <- as.matrix(dat[-1])
# create result matrix of desired shape, filled with 0s
res <- matrix(0L, nrow(dat), ncol=max(myMat, na.rm=TRUE))
# use matrix indexing to fill in 1s
res[cbind(dat$ID, as.vector(myMat))] <- 1L
# convert to data.frame, add ID column, and provide variable names
setNames(data.frame(cbind(dat$ID, res)), c("ID", paste0("c", 1:8)))
which returns
ID c1 c2 c3 c4 c5 c6 c7 c8
1 1 1 0 1 1 0 1 0 0
2 2 0 1 0 1 0 0 0 0
3 3 1 0 1 0 1 0 0 1
4 4 1 1 1 0 1 0 0 0

Using table() to create 3 variable frequency table in R

I'm new to R and seeking some help. I understand the following problem is fairly simple and have looked for similar questions. None give quite the answer I'm looking for - any help would be appreciated.
The problem:
Producing a frequency table using the table() function for three variables with data in the format:
Var1 Var2 Var3
1 0 1 0
2 0 1 0
3 1 1 1
4 0 0 1
Where, 0 = "No" and 1 = "Yes"
And the final table is in the following format with variables and values labelled:
Var3
Yes No
Var1 Yes 1 0
No 1 2
Var2 Yes 1 2
No 1 0
What I have tried so far:
Using the following code I'm able to produce a 2 variable table, with labels for the variables but not for the values (ie. No and Yes).
table(data$Var1, data$Var3, dnn = c("Var1", "Var3"))
It looks like this:
Var3
Var1 0 1
0 2 1
1 0 1
In trying to label the row and column values (0 = No and 1= Yes) I understand row.names and responseName can be used, however the following attempt to label row names gives an all arguments must have the same length error.
> table(data$Var1, data$Var2, dnn = c("Var1", "Var2"), row.names = c("No", "Yes"))
I have also tried using ftable() however the shape of the table produced using code below is not correct resulting in incorrect frequencies for the problem. The issue with labeling row & col values persists.
> ftable(data$Var1, data$Var2, data$Var3, dnn = c("Var1", "Var2", "Var3"))
Var3 0 1
Var1 Var2
0 0 0 1
1 2 0
1 0 0 0
1 0 1
Any help in using table() to produce a table of the shape desired would be greatly appreciated.
You could try tabular from library(tables) after changing the labels as showed by #thelatemail
library(tables)
data[] <- lapply(data, factor, levels=1:0, labels=c('Yes', 'No'))
tabular(Var1+Var2~Var3, data=data)
# Var3
# Yes No
#Var1 Yes 1 0
# No 1 2
#Var2 Yes 1 2
# No 1 0
data
data <- structure(list(Var1 = c(0L, 0L, 1L, 0L), Var2 = c(1L, 1L, 1L,
0L), Var3 = c(0L, 0L, 1L, 1L)), .Names = c("Var1", "Var2", "Var3"
), class = "data.frame", row.names = c("1", "2", "3", "4"))
The easiest way is to probably use the reshape2 package. Firstly you will need to convert your numeric information to factors so that it doesn't treat it as a number.
data$Var1 <- as.factor(data$Var1)
data$Var2 <- as.factor(data$Var2)
data$Var3 <- as.factor(data$Var3)
Then you can easily just apply table(data) to get the information you want. If you really want to transform it in the format you specified, then pull it as a data.frame and then transform it as required:
df <- as.data.frame(table(data))
library(reshape2)
dcast(df, Var1+Var2 ~ Var3)
This as the output:
Var1 Var2 0 1
1 0 0 0 1
2 0 1 2 0
3 1 0 0 0
4 1 1 0 1
EDIT: You can just use ftable on the data frame once its all factors:
> ftable(data)
Var3 0 1
Var1 Var2
0 0 0 1
1 2 0
1 0 0 0
1 0 1

Expanding a data.frame by replacing missing values with set of all possible values in R

I want to expand my dataset by replacing each incomplete row with the set of all possible rows. Does anyone have any suggestions for an efficient way to do this?
For example, suppose X and Z can each take values 0 or 1.
Input:
id y x z
1 1 0 0 NA
2 2 1 NA 0
3 3 0 1 1
4 4 1 NA NA
Output:
id y x z
1 1 0 0 0
2 1 0 0 1
3 2 1 0 0
4 2 1 1 0
5 3 0 1 1
6 4 1 0 0
7 4 1 0 1
8 4 1 1 0
9 4 1 1 1
At the moment I just work through the original dataset row by row:
for(i in 1:N){
if(is.na(temp.dat$x[i]) & !is.na(temp.dat$z[i])){
augment <- matrix(rep(temp.dat[i,],2),ncol=ncol(temp.dat),byrow=TRUE)
augment[,3] <- c(0,1)
}else
if(!is.na(temp.dat$x[i]) & is.na(temp.dat$z[i])){
augment <- matrix(rep(temp.dat[i,],2),ncol=ncol(temp.dat),byrow=TRUE)
augment[,4] <- c(0,1)
}else{
if(is.na(temp.dat$x[i]) & is.na(temp.dat$z[i])){
augment <- matrix(rep(temp.dat[i,],4),ncol=ncol(temp.dat),byrow=TRUE)
augment[,3] <- c(0,0,1,1)
augment[,4] <- c(0,1,0,1)
}
}
You could try by
Creating an "indx" of count of "NAs" in each row (rowSums(is.na(...))
Use the "indx" to expand the rows of the original dataset (df[rep(1:nrow...)
Loop over (sapply) the "indx" and use that as "times" argument in rep, and do expand.grid of values 0,1 to create the "lst"
split the expanded dataset, "df1", by "id"
Use Map to change corresponding "NA" values in "lst2" by the values in "lst"
rbind the list elements
indx <- rowSums(is.na(df[-1]))
df1 <- df[rep(1:nrow(df), 2^indx),]
lst <- sapply(indx, function(x) expand.grid(rep(list(0:1), x)))
lst2 <- split(df1, df1$id)
res <- do.call(rbind,Map(function(x,y) {x[is.na(x)] <- as.matrix(y);x},
lst2, lst))
row.names(res) <- NULL
res
# id y x z
#1 1 0 0 0
#2 1 0 0 1
#3 2 1 0 0
#4 2 1 1 0
#5 3 0 1 1
#6 4 1 0 0
#7 4 1 1 0
#8 4 1 0 1
#9 4 1 1 1
data
df <- structure(list(id = 1:4, y = c(0L, 1L, 0L, 1L), x = c(0L, NA,
1L, NA), z = c(NA, 0L, 1L, NA)), .Names = c("id", "y", "x", "z"
), class = "data.frame", row.names = c("1", "2", "3", "4"))

Resources