Shifting a column down by one - r

Say I have a data.frame that looks like this
df <- data.frame(AAA = rep(c(NA,sample(1:10, 1)),5),
BBB = rep(c(NA,sample(1:10, 1)),5),
CCC = rep(c(sample(1:10, 1),NA),5))
> df
AAA BBB CCC
1 NA NA 10
2 3 7 NA
3 NA NA 10
4 3 7 NA
5 NA NA 10
6 3 7 NA
7 NA NA 10
8 3 7 NA
9 NA NA 10
10 3 7 NA
I want to shift column CCC down by one so that all the numbers align in a single row, and then delete the rows that contain no data (often every other row - BUT NOT ALWAYS - the pattern might vary through the data.frame.

Using dplyr
library(dplyr)
df %>%
mutate(CCC=lag(CCC)) %>%
na.omit()
Or using data.table
library(data.table)
na.omit(setDT(df)[, CCC:=c(NA, CCC[-.N])])

Use a combination of the very efficient transform and na.omit functions
df <- na.omit(transform(df, CCC = c(NA, CCC[-nrow(df)])))

You can shift everything down by one with:
df['CCC'] <- c(NA, head(df['CCC'], dim(df)[1] - 1)[[1]])
To delete rows with only NA values, do:
df <- df[apply(df, 1, function(x) !all(is.na(x))), ]

Related

R - Merging rows with numerous NA values to another column

I would like to ask the R community for help with finding a solution for my data, where any consecutive row with numerous NA values is combined and put into a new column.
For example:
df <- data.frame(A= c(1,2,3,4,5,6), B=c(2, "NA", "NA", 5, "NA","NA"), C=c(1,2,"NA",4,5,"NA"), D=c(3,"NA",5,"NA","NA","NA"))
A B C D
1 1 2 1 3
2 2 NA 2 NA
3 3 NA NA 5
4 4 5 4 NA
5 5 NA 5 NA
6 6 NA NA NA
Must be transformed to this:
A B C D E
1 1 2 1 3 2 NA 2 NA 3 NA NA 5
2 4 5 4 NA 5 NA 5 NA 6 NA NA NA
I would like to do the following:
Identify consecutive rows that have more than 1 NA value -> combine entries from those consecutive rows into a single combined entiry
Place the above combined entry in new column "E" on the prior row
This is quite complex (for me!) and I am wondering if anyone can offer any help with this. I have searched for some similar problems, but have been unable to find one that produces a similar desired output.
Thank you very much for your thoughts--
Using tidyr and dplyr:
Concatenate values for each row.
Keep the concatenated values only for rows with more than one NA.
Group each “good” row with all following “bad” rows.
Use a grouped summarize() to concatenate “bad” row values to a single string.
df %>%
unite("E", everything(), remove = FALSE, sep = " ") %>%
mutate(
E = if_else(
rowSums(across(!E, is.na)) > 1,
E,
""
),
new_row = cumsum(E == "")
) %>%
group_by(new_row) %>%
summarize(
across(A:D, first),
E = trimws(paste(E, collapse = " "))
) %>%
select(!new_row)
# A tibble: 2 × 5
A B C D E
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 2 1 3 2 NA 2 NA 3 NA NA 5
2 4 5 4 NA 5 NA 5 NA 6 NA NA NA

Tidying in R: how to collapse my binary columns into characters, based on vectors?

I am tidying my data in R, and want to turn multiple columns into 1, using a function iterating over the items of a vector. I was wondering whether you could help me out to:
work away a semantic error,
and make my code more efficient?
My data is based on a survey with 32 questions. Each question has multiple answers. Each answer is a column, with options 1 and NA.
For one question, a section of the dataset can be reproduced as follows:
XV2_1 <- c(1,NA,NA,NA)
XV2_2 <- c(NA,1,NA,NA)
XV2_3 <- c(NA,NA,NA,1)
XV2_4 <- c(NA,NA,1,NA)
id <- c(12,13,14,15)
dat <- data.frame(id,XV2_1, XV2_2, XV2_3,XV2_4)
> dat
id XV2_1 XV2_2 XV2_3 XV2_4
1 12 1 NA NA NA
2 13 NA 1 NA NA
3 14 NA NA NA 1
4 15 NA NA 1 NA
This is the data I would like to have (
question_2_answers <- c("Yellow","Blue","Green","Orange") #this is a vector based on the answers of the questionnaire
collapsed <- c("Yellow","Blue","Orange","Green")
collapsed_dataframe <- data.frame(id,collapsed)
>collapsed_dataframe
id X2
1 12 Yellow
2 13 Blue
3 14 Green
4 15 Orange
So far, I tried a sequence of "ifelse's" combined with mutate:
library(tidyverse)
question_2_answers <- c("Yellow","Blue","Green","Orange") #this is a vector based on the answers of the questionnaire
dat %>%
mutate(
Colour = tidy_Q2(question_2_answers,XV2_1,XV2_2,XV2_3,XV2_4)
)
tidy_Q2 <- function(a,b,c,d,e) {
ifelse(b == 1, a[1],ifelse(
c==1,a[2],ifelse(
d==1,a[3],a[4])))
}
However, my output is not as expected:
id XV2_1 XV2_2 XV2_3 XV2_4 Colour
1 12 1 NA NA NA Yellow
2 13 NA 1 NA NA <NA>
3 14 NA NA NA 1 <NA>
4 15 NA NA 1 NA <NA>
I would have liked it to be as follows:
id XV2_1 XV2_2 XV2_3 XV2_4 Colour
1 12 1 NA NA NA Yellow
2 13 NA 1 NA NA Blue
3 14 NA NA NA 1 Green
4 15 NA NA 1 NA Orange
Does anyone know a way to remove the error?
Another question that I'd like to ask, is whether my code can be more efficient? I have 32 survey_questions in store after this, I'd like to automate the process as much as possible. Notable things to take in mind:
not all survey questions have the same amount of options (i.e. question 2 has 2 options and therefore 2 columns, whilst question 10 has 8 options and 8 columns)
some values are strings, instead of 1 or NA
Always happy to learn,
Best,
Maria
This is a kind of wide-to-long conversion which we can do with tidyr::gather:
First, we make the colors the column names of the appropriate rows:
# Replace column names (except for the `id` column) with color values
colnames(dat)[-1] <- c("Yellow","Blue","Orange","Green")
dat
id Yellow Blue Orange Green
1 12 1 NA NA NA
2 13 NA 1 NA NA
3 14 NA NA NA 1
4 15 NA NA 1 NA
Then, we gather the non-id columns and drop the NA values:
library(tidyverse)
dat %>%
gather(X2, val, -id) %>% # Gather color cols from wide to long format
filter(!is.na(val)) %>% # Drop rows with NA values
select(-val) # Remove the unnecessary `val` column
id X2
1 12 Yellow
2 13 Blue
3 15 Orange
4 14 Green
This will work with any number of columns (you just need to specify all columns you don't want to gather) and keeps rows with non-NA values. If you want other conditions to exclude a row (for example, if 0 or 'unknown' should count as a non-answer, or only 'correct' counts as an answer) then you should add those conditions to the filter statement.
One option in base R would be max.col is to find the column index of values that are not NA in each row, use that to get the column names corresponding to the index, create a 2 column data.frame by cbinding with the first column
i1 <- max.col(!is.na(dat[-1]), 'first')
cbind(dat['id'], Colour = names(dat)[-1][i1])
# id Colour
#1 12 Yellow
#2 13 Blue
#3 14 Green
#4 15 Orange
data
dat <- structure(list(id = c(12, 13, 14, 15), Yellow = c(1, NA, NA,
NA), Blue = c(NA, 1, NA, NA), Orange = c(NA, NA, NA, 1), Green = c(NA,
NA, 1, NA)), class = "data.frame", row.names = c(NA, -4L))

Data.table: rbind a list of data tables with unequal columns [duplicate]

This question already has answers here:
rbindlist data.tables with different number of columns
(1 answer)
Rbind with new columns and data.table
(5 answers)
Closed 4 years ago.
I have a list of data tables that are of unequal lengths. Some of the data tables have 35 columns and others have 36.
I have this line of code, but it generates an error
> lst <- unlist(full_data.lst, recursive = FALSE)
> model_dat <- do.call("rbind", lst)
Error in rbindlist(l, use.names, fill, idcol) :
Item 1362 has 35 columns, inconsistent with item 1 which has 36 columns. If instead you need to fill missing columns, use set argument 'fill' to TRUE.
Any suggestions on how I can modify that so that it works properly.
Here's a minimal example of what you are trying to do.
No need to use any other package to do this. Just set fill=TRUE in rbindlist.
You can do this:
df1 <- data.table(m1 = c(1,2,3))
df2 <- data.table(m1 = c(1,2,3), m2=c(3,4,5))
df3 <- rbindlist(list(df1, df2), fill=T)
print(df3)
m1 m2
1: 1 NA
2: 2 NA
3: 3 NA
4: 1 3
5: 2 4
6: 3 5
If I understood your question correctly, I could possibly see only two options for having your data tables appended.
Option A: Drop the extra variable from one of the datasets
table$column_Name <- NULL
Option B) Create the variable with missing values in the incomplete dataset.
full_data.lst$column_Name <- NA
And then do rbind function.
Try to use rbind.fill from package plyr:
Input data, 3 dataframes with different number of columns
df1<-data.frame(a=c(1,2,3,4,5),b=c(1,2,3,4,5))
df2<-data.frame(a=c(1,2,3,4,5,6),b=c(1,2,3,4,5,6),c=c(1,2,3,4,5,6))
df3<-data.frame(a=c(1,2,3),d=c(1,2,3))
full_data.lst<-list(df1,df2,df3)
The solution
library("plyr")
rbind.fill(full_data.lst)
a b c d
1 1 1 NA NA
2 2 2 NA NA
3 3 3 NA NA
4 4 4 NA NA
5 5 5 NA NA
6 1 1 1 NA
7 2 2 2 NA
8 3 3 3 NA
9 4 4 4 NA
10 5 5 5 NA
11 6 6 6 NA
12 1 NA NA 1
13 2 NA NA 2
14 3 NA NA 3

Find all indices of duplicates and write them in new columns

I have a data.frame with a single column, a vector of strings.
These strings have duplicate values.
I want to find the character strings that have duplicates in this vector and write their index of position in a new column.
So for example consider I have:
DT<- data.frame(string=A,B,C,D,E,F,A,C,F,Z,A)
I want to get:
string match2 match2 match3 matchx....
A 1 7 11
B 2 NA NA
C 3 8 NA
D 4 NA NA
E 5 NA NA
F 6 9 NA
A 1 7 11
C 3 8 NA
F 6 9 NA
Z 10 NA NA
A 1 7 11
The string is ways longer than in this example and I do not know the amount of maximum columns I need.
What will be the most effective way to do this?
I know that there is the duplicate function but I am not exactly sure how to combine it to the result I want to get here.
Many thanks!
Here's one way of doing this. I'm sure a data.table one liner follows.
DT<- data.frame(string=c("A","B","C","D","E","F","A","C","F","Z","A"))
# find matches
rbf <- sapply(DT$string, FUN = function(x, DT) which(DT %in% x), DT = DT$string)
# fill in NAs to have a pretty matrix
out <- sapply(rbf, FUN = function(x, mx) c(x, rep(NA, length.out = mx - length(x))), max(sapply(rbf, length)))
# bind it to the original data
cbind(DT, t(out))
string 1 2 3
1 A 1 7 11
2 B 2 NA NA
3 C 3 8 NA
4 D 4 NA NA
5 E 5 NA NA
6 F 6 9 NA
7 A 1 7 11
8 C 3 8 NA
9 F 6 9 NA
10 Z 10 NA NA
11 A 1 7 11
Here is one option with data.table. After grouping by 'string', get the sequence (seq_len(.N)) and row index (.I), then dcast to 'wide' format and join with the original dataset on the 'string'
library(data.table)
dcast(setDT(DT)[, .(seq_len(.N),.I), string],string ~ paste0("match", V1))[DT, on = "string"]
# string match1 match2 match3
# 1: A 1 7 11
# 2: B 2 NA NA
# 3: C 3 8 NA
# 4: D 4 NA NA
# 5: E 5 NA NA
# 6: F 6 9 NA
# 7: A 1 7 11
# 8: C 3 8 NA
# 9: F 6 9 NA
#10: Z 10 NA NA
#11: A 1 7 11
Or another option would be to split the sequence of rows with 'string', pad the list elements with NA for length that are less, and merge with the original dataset (using base R methods)
lst <- split(seq_len(nrow(DT)), DT$string)
merge(DT, do.call(rbind, lapply(lst, `length<-`, max(lengths(lst)))),
by.x = "string", by.y = "row.names")
data
DT<- data.frame(string=c("A","B","C","D","E","F","A","C",
"F","Z","A"), stringsAsFactors=FALSE)
And here's one that uses tidyverse tools ( not quite a one-liner ;) ):
library( tidyverse )
DT %>% group_by( string ) %>%
do( idx = which(DT$string == unique(.$string)) ) %>%
ungroup %>% unnest %>% group_by( string ) %>%
mutate( m = stringr::str_c( "match", 1:n() ) ) %>%
spread( m, idx )

Replacing NAs between two rows with identical values in a specific column

I have a dataframe with multiple columns and I want to replace NAs in one column if they are between two rows with an identical number. Here is my data:
v1 v2
1 2
NA 3
NA 2
1 1
NA 7
NA 2
3 1
I basically want to start from the beginning of the data frame and replcae NAs in column v1 with previous Non NA if the next Non NA matches the previous one. That been said, I want the result to be like this:
v1 v2
1 2
1 3
1 2
1 1
NA 7
NA 2
3 1
As you may see, rows 2 and 3 are replaced with number "1" because row 1 and 4 had an identical number but rows 5,6 stays the same because the non na values in rows 4 and 7 are not identical. I have been twicking a lot but so far no luck. Thanks
Here is an idea using zoo package. We basically fill NAs in both directions and set NA the values that are not equal between those directions.
library(zoo)
ind1 <- na.locf(df$v1, fromLast = TRUE)
df$v1 <- na.locf(df$v1)
df$v1[df$v1 != ind1] <- NA
which gives,
v1 v2
1 1 2
2 1 3
3 1 2
4 1 1
5 NA 7
6 NA 2
7 3 1
Here is a similar approach in tidyverse using fill
library(tidyverse)
df1 %>%
mutate(vNew = v1) %>%
fill(vNew, .direction = 'up') %>%
fill(v1) %>%
mutate(v1 = replace(v1, v1 != vNew, NA)) %>%
select(-vNew)
# v1 v2
#1 1 2
#2 1 3
#3 1 2
#4 1 1
#5 NA 7
#6 NA 2
#7 3 1
Here is a base R solution, the logic is almost the same as Sotos's one:
replace_na <- function(x){
f <- function(x) ave(x, cumsum(!is.na(x)), FUN = function(x) x[1])
y <- f(x)
yp <- rev(f(rev(x)))
ifelse(!is.na(y) & y == yp, y, x)
}
df$v1 <- replace_na(df$v1)
test:
> replace_na(c(1, NA, NA, 1, NA, NA, 3))
[1] 1 1 1 1 NA NA 3
I could use na.locf function to do so. Basically, I use the normal na.locf function package zoo to replace each NA with the latest previous non NA and store the data in a column. by using the same function but fixing fromlast=TRUE NAs are replaces with the first next nonNA and store them in another column. I checked these two columns and if the results in each row for these two columns are not matching I replace them with NA.

Resources