Replace NA in row with value in adjacent row "ROW" not column [duplicate] - r

This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 1 year ago.
Raw data:
V1 V2
1 c1 a
2 c2 b
3 <NA> c
4 <NA> d
5 c3 e
6 <NA> f
7 c4 g
Reproducible Sample Data
V1 = c('c1','c2',NA,NA,'c3',NA,'c4')
V2 = c('a','b','c','d','e','f','g')
data.frame(V1,V2)
Expected output
V1_after V2_after
1 c1 a
2 c2 b c d
3 c3 e f
4 c4 g
V1_after <- c('c1','c2','c3','c4')
V2_after <- c('a',paste('b','c','d'),paste('e','f'),'g')
data.frame(V1_after,V2_after)
This is sample data.
In Real data, Rows where NA in V1 is not regular
It is too difficult to me

You could make use of zoo::na.locf for this. It takes the most recent non-NA value and fill all NA values on the way:
library(dplyr)
library(zoo)
df %>%
mutate(V1 = zoo::na.locf(V1)) %>%
group_by(V1) %>%
summarise(V2 = paste0(V2, collapse = " "))
# A tibble: 4 x 2
V1 V2
<chr> <chr>
1 c1 a
2 c2 b c d
3 c3 e f
4 c4 g

A base R option using na.omit + cumsum + aggregate
aggregate(
V2 ~ .,
transform(
df,
V1 = na.omit(V1)[cumsum(!is.na(V1))]
), c
)
gives
V1 V2
1 c1 a
2 c2 b, c, d
3 c3 e, f
4 c4 g

You can fill the NA with the previous non-NA values and summarise the data.
library(dplyr)
library(tidyr)
df %>%
fill(V1) %>%
group_by(V1) %>%
summarise(V2 = paste(V2, collapse = ' '))
# V1 V2
# <chr> <chr>
#1 c1 a
#2 c2 b c d
#3 c3 e f
#4 c4 g

Related

Group values in rows according into similar columns

I had a column with multiple values inside it..
Like...
ColumnX1
A,D,C,B,F,E,G
F,A,B,E,G,C
C,D,G,F,A,T
I splitted the data with
Species_Data2 <- data.frame(str_split_fixed(Species_Data$Other.Anopheline.species, ",", 21))
But I got the values as below:
I have dataframe like:-
X1 X2 X3 X4 X5 X6 X7
A D C B F E G
F A B E G NA C
C D G F A T NA
I wanted to make a dataframe like:
X1 X2 X3 X4 X5 X6 X7 X8
A B C D E F G NA
A B C NA E F G NA
A NA C D NA F G T
and then....
I want to make the columns names as row values:-
Colnames
'A' 'B' 'C' 'D' 'E' 'F' 'G' 'T'
A B C D E F G NA
A B C NA E F G NA
A NA C D NA F G T
Tried to create sorting...but does not work that great... :(..
Comes up with O values though....
If I understand correctly, the OP wants to rearrange the data so that there is a separate column for each letter. If a letter is present in a row, then the letter appears in the appropriate column/row of the reshaped data. NA indicates that a letter is missing in a row. In addition, the letter columns should be arranged in alphabetical order.
1. dplyr/tidyr approach
If we start with the data.frame resulting from OP's call to stringr::str_split_fixed() we need to reshape the splitted data from wide to long format, remove empty entries, order rows so that columns appear in letter order and reshape to wide format again. For reshaping, a row id is required. To achieve the desired output, pivot_wide() has to be called the names_from = value parameter:
library(dplyr)
library(tidyr)
as.data.frame(stringr::str_split_fixed(DF$ColumnX1, ",", 21)) %>%
mutate(rn = row_number()) %>%
pivot_longer(-rn) %>%
filter(value != "") %>%
arrange(as.character(value)) %>%
pivot_wider(rn, names_from = value)
rn A B C D E F G T
<int> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
1 1 A B C D E F G NA
2 2 A B C NA E F G NA
3 3 A NA C D NA F G T
2. data.table approach
If we start from the unsplitted original data, there is a much more concise variant which uses data.table's dcast() for reshaping:
library(data.table)
setDT(DF)[, stringr::str_split(ColumnX1, ","), by = 1:nrow(DF)][, dcast(.SD, nrow ~ V1)]
nrow A B C D E F G T
1: 1 A B C D E F G <NA>
2: 2 A B C <NA> E F G <NA>
3: 3 A <NA> C D <NA> F G T
If required, the additional row id column can be removed in both approaches.
Data
DF <- data.frame(ColumnX1 = c("A,D,C,B,F,E,G",
"F,A,B,E,G,C",
"C,D,G,F,A,T")
)
EDIT: Duplicate values
In a comment, the OP has disclosed that the production dataset contains duplicate values.
In case of duplicate values, dcast() uses the length() function by default to aggregate the data.
With a modified dataset DF2 which contains duplicate values in rows 1 and 2, the original data.table approach returns:
library(data.table)
setDT(DF2)[, stringr::str_split(ColumnX1, ","), by = 1:nrow(DF)][, dcast(.SD, nrow ~ V1)]
nrow A B C D E F G T
1: 1 1 1 2 1 1 1 1 0
2: 2 1 1 1 0 1 2 1 0
3: 3 1 0 1 1 0 1 1 1
Here, the number of duplicate letters is shown.
The expected behaviour can be restored by removing the duplicate values before reshaping by using unique():
setDT(DF2)[, stringr::str_split(ColumnX1, ","), by = 1:nrow(DF)][
, dcast(unique(.SD), nrow ~ V1)]
nrow A B C D E F G T
1: 1 A B C D E F G <NA>
2: 2 A B C <NA> E F G <NA>
3: 3 A <NA> C D <NA> F G T
Also the dplyr/tidyr approach needs to be modified by specifying an appropriate aggregation function in the call to pivot_wider():
library(dplyr)
library(tidyr)
as.data.frame(stringr::str_split_fixed(DF2$ColumnX1, ",", 21)) %>%
mutate(rn = row_number()) %>%
pivot_longer(-rn) %>%
filter(value != "") %>%
arrange(as.character(value)) %>%
pivot_wider(rn, names_from = value, values_fn = list(value = unique))
Data with duplicate values
DF2 <- data.frame(ColumnX1 = c("A,D,C,B,F,E,G,C",
"F,A,B,E,G,C,F",
"C,D,G,F,A,T")
)

Looking up a particular column in R depending on another column

In an experiment, people had four candidates to choose from; sometimes they're male, other times they're female. In the below dataframe, C1 means Candidate 1, C2 means Candidate 2, and so on. F denotes female while M denotes male. A response of 1 indicates the person chose C1, a response of 2 indicates the person chose C2, and so on.
C1 C2 C3 C4 response
F F M M 2
M M F M 1
I want a new column "ChooseFemale" which equals to 1 if the candidate chose a female candidate, and zero otherwise. So the first row should have ChooseFemale equal to 1, while the second row should have ChooseFemale equal to zero.
This would require me to look up a certain column depending on the value of "response" column.
How can I do this?
Another base R solution:
x <- df[["response"]]
df$ChooseFemale <- as.integer(df[cbind(seq_along(x), x)] == "F")
C1 C2 C3 C4 response ChooseFemale
1 F F M M 2 1
2 M M F M 1 0
Data:
Lines <- "C1 C2 C3 C4 response
F F M M 2
M M F M 1"
df <- read.table(text = Lines, header = TRUE, stringsAsFactors = FALSE)
# create dataframe
my.df <- data.frame(c1=c('f','m'),
c2=c('f','m'),
c3=c('m','f'),
c4=c('m','m'),
resp=c(2, 1))
# add column
my.df$ChooseFemale <- NA
# loop over rows
for (row in 1:nrow(my.df)){
# extract the column to check from response column
col <- paste0('c', my.df$resp[row])
# fill in new column
my.df$ChooseFemale[row] <- ifelse(my.df[row, col]=='f', 1, 0)
}
apply(df,1,function(x) ifelse(df[,as.numeric(x['response'])]=='F',1,0))[,1]
[1] 1 0
Here is the basic idea, select the column using the value in response. Then use apply with MARGIN=1 to apply this function row by row.
df[1,'response']
[1] 2
df[1,df[1,'response']]
[1] F
Levels: F M
data
df <- read.table(text = "
C1 C2 C3 C4 response
F F M M 2
M M F M 1
",header=T)
You can create a simple function to check whether the response number matches "F", and then apply it to each row at once.
A tidyverse approach:
library(tidyverse)
mydata <- data.frame(C1=sample(c("F","M"),10,replace = T),
C2=sample(c("F","M"),10,replace = T),
C3=sample(c("F","M"),10,replace = T),
C4=sample(c("F","M"),10,replace = T),
response=sample(c(1:4),10,replace = T),
stringsAsFactors = FALSE)
C1 C2 C3 C4 response
1 M M M M 1
2 F F F M 4
3 M F M M 2
4 F M M F 2
5 M M M F 1
6 M F M F 4
7 M M M F 3
8 M M M M 2
9 M F M M 3
10 F F M F 4
Custom function to check if the response matches "F"
female_choice <- function(C1, C2, C3, C4, response) {
c(C1, C2, C3, C4)[response] == "F"
}
And then just use mutate() to modify your dataframe, and pmap() to use its rows, one by one, as the set of arguments for female_choice()
mydata %>%
mutate(ChooseFemale = pmap_chr(., female_choice))
C1 C2 C3 C4 response ChooseFemale
1 M M M M 1 FALSE
2 F F F M 4 FALSE
3 M F M M 2 TRUE
4 F M M F 2 FALSE
5 M M M F 1 FALSE
6 M F M F 4 TRUE
7 M M M F 3 FALSE
8 M M M M 2 FALSE
9 M F M M 3 FALSE
10 F F M F 4 TRUE
Here is one way to do it using tidyverse packages. As specified in the question, this takes into account both which candidate was chosen (C1-C4) and sex of the candidate (F/M):
# loading needed libraries
library(tidyverse)
# data
df <- utils::read.table(text = "C1 C2 C3 C4 response
F F M M 2
M M F M 1", header = TRUE) %>%
tibble::as_data_frame(x = .) %>%
tibble::rowid_to_column(.)
# manipulation
dplyr::full_join(
# creating dataframe with the new chooseFemale variable
x = df %>%
tidyr::gather(
data = .,
key = "candidate",
value = "choice",
C1:C4
) %>%
dplyr::mutate(choice_new = paste("C", response, sep = "")) %>%
# creating the needed column by checking both the candidate chosen and
# the sex of the candidate
dplyr::mutate(chooseFemale = dplyr::case_when((choice_new == candidate) &
(choice == "F") ~ 1,
(choice_new == candidate) &
(choice == "M") ~ 0
)) %>%
dplyr::select(.data = ., -choice_new) %>%
tidyr::spread(data = ., key = candidate, value = choice) %>%
dplyr::filter(.data = ., !is.na(chooseFemale)) %>%
dplyr::select(.data = ., -c(C1:C4)),
# original dataframe
y = df,
by = c("rowid", "response")
) %>% # removing the redundant row id
dplyr::select(.data = ., -rowid) %>% # rearranging the columns
dplyr::select(.data = ., C1:C4, response, chooseFemale)
#> # A tibble: 2 x 6
#> C1 C2 C3 C4 response chooseFemale
#> <fct> <fct> <fct> <fct> <int> <dbl>
#> 1 F F M M 2 1
#> 2 M M F M 1 0
Created on 2018-08-24 by the reprex package (v0.2.0.9000).
I'll provide an answer in the tidyr format. Your data is in a "wide" format. This makes it very human readable, but not necessarily machine readable. The first step to making it more tidy is to convert the data to long format. In other words, let's transform the data so that we don't have to do calculations across multiple columns in a single row.
tidy format allows you to use grouping variables, create summaries, etc.
library(dplyr)
library(tidyr)
df <- data.frame(C1 = c("F","M"),
C2 = c("F","M"),
C3 = c("M","F"),
C4 = c("M","M"),
stringsAsFactors = FALSE)
> df
C1 C2 C3 C4
1 F F M M
2 M M F M
Let's add an "id" field so we can keep track of each unique row. This is the same as the row number...but we are going to be converting the wide data to long data with different row numbers. Then use gather to convert from wide data to long data.
df_long <- df %>%
mutate(id = row_number(C1)) %>%
gather(key = "key", value = "value",C1:C4)
> df_long
id key value
1 1 C1 F
2 2 C1 M
3 1 C2 F
4 2 C2 M
5 1 C3 M
6 2 C3 F
7 1 C4 M
8 2 C4 M
Now it is possible to use group_by() to group based on variables, perform summaries, etc.
For what you've asked you group by the id column and then perform calculations on the group. In this case we will take the sum of all values that are "F". Then we ungroup and spread back to the wide / human readable format.
df_long %>%
group_by(id) %>%
mutate(response = sum(value=="F",na.rm=TRUE)) %>%
ungroup()
> df_long
# A tibble: 8 x 4
id key value response
<int> <chr> <chr> <int>
1 1 C1 F 2
2 2 C1 M 1
3 1 C2 F 2
4 2 C2 M 1
5 1 C3 M 2
6 2 C3 F 1
7 1 C4 M 2
8 2 C4 M 1
To get the data back in wide format once you are done doing all calculations that you need in long format:
df <- df_long %>%
spread(key,value)
> df
# A tibble: 2 x 6
id response C1 C2 C3 C4
<int> <int> <chr> <chr> <chr> <chr>
1 1 2 F F M M
2 2 1 M M F M
To get the data back in the order you had it:
df <- df %>%
select(-id) %>%
select(C1:C4,everything())
> df
# A tibble: 2 x 5
C1 C2 C3 C4 response
<chr> <chr> <chr> <chr> <int>
1 F F M M 2
2 M M F M 1
You can of course use the pipes to do this all in one step.
df <- df %>%
mutate(id = row_number(C1)) %>%
gather(key = "key", value = "value",C1:C4) %>%
group_by(id) %>%
mutate(response = sum(value=="F",na.rm=TRUE)) %>%
ungroup() %>%
spread(key,value) %>%
select(-id) %>%
select(C1:C4,everything())

How to get the values with maximum occurrence across range of columns based on another column

I have a data frame with 29 rows and 26 column with a lot of NA's. Data looks somewhat like shown below( working on R studio)
df <-
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
a1 b1 d f d d na na na f
a1 b2 d d d f na f na na
a1 b3 d f d f na na d d
a2 c1 f f d na na d d f
a2 c2 f d d f na f na na
a2 c3 d f d f na na f d
Here we have V1-V10 columns. a1 and a2 are 2 distinct values in column V1,
b1-b3 in column V2 are distinct values related to a1 in V1 and c1-c3 related to a2.
column V3- V10 we have distinct values in each rows related to a1 and a2
Result i want is as below-
NewV1 max.occurrence(V3-V10)
a1 d
a2 f
to summarize i want to get the value with maximum occurrence(max.occurrence(V3-V10)) across column V3-V10 based on V1. NOTE= NA to be excluded.
Another possiblity using the data.table-package:
library(data.table)
melt(setDT(df),
id = 1:2,
na.rm = TRUE)[, .N, by = .(V1, value)
][order(-N), .(max.occ = value[1]), by = V1]
which gives:
V1 max.occ
1: a1 d
2: a2 f
A similar logic with the tidyverse-packages:
library(dplyr)
library(tidyr)
df %>%
gather(k, v, V3:V10, na.rm = TRUE) %>%
group_by(V1, v) %>%
tally() %>%
arrange(-n) %>%
slice(1) %>%
select(V1, max.occ = v)
If you like dplyr, this would work:
df %>%
gather("key", "value", V3:V10) %>%
group_by(V1) %>%
dplyr::summarise(max.occurence = names(which.max(table(value))))
This gives:
# A tibble: 2 x 2
V1 max.occurence
<fct> <chr>
1 a1 d
2 a2 f

collapse text by 2 ID's in a row

I have a question similar to this topic: "Collapse text by group in data frame [duplicate]"
group text
a a1
a a2
a a3
b b1
b b2
c c1
c c2
c c3
c c4
I would like to collapse by two sequential ID's (not the whole ID group)
group text
a a1a2
a a2a3
b b1b2
c c1c2
c c2c3
c c3c4
Alternative tidyverse answer:
library(tidyverse)
dat %>%
group_by(group) %>%
mutate(text=paste0(lag(text),text)) %>% slice(-1)
Using data.table:
library(data.table)
setDT(dat)
dat[, paste0(shift(text,1), text)[-1], by=group]
# group V1
#1: a a1a2
#2: a a2a3
#3: b b1b2
#4: c c1c2
#5: c c2c3
#6: c c3c4
How about this:
library(tidyverse)
df %>%
group_by(group) %>%
mutate(text = c(paste0(text[1:(n()-1)],text[2:n()]),NA)) %>%
filter(!is.na(text))
or
df %>%
group_by(group) %>%
summarise(text = list(paste0(text[1:(n()-1)],text[2:n()]))) %>%
unnest
group text
1 a a1a2
2 a a2a3
3 b b1b2
4 c c1c2
5 c c2c3
6 c c3c4
The code above assumes the group length is always greater than one. If there are single-row groups, you'll need an if statement to treat them differently. For example, if we add a row with group="d" and text="d1" you could do this:
df %>%
group_by(group) %>%
summarise(text = if(n()==1) list(text) else list(paste0(text[1:(n()-1)],text[2:n()]))) %>%
unnest
group text
<chr> <chr>
1 a a1a2
2 a a2a3
3 b b1b2
4 c c1c2
5 c c2c3
6 c c3c4
7 d d1
you can try:
unlist(by(df2$text,df2$group,function(x)paste0(head(x,-1),x[-1])))
a1 a2 b c1 c2 c3
"a1a2" "a2a3" "b1b2" "c1c2" "c2c3" "c3c4"
Another base R option with split and stack
stack(lapply(split(df1$text, df1$group), function(x) paste0(x[-length(x)], x[-1])))[2:1]

How to get the top element per group with multiple columns?

I have the use-case shown below. Basically I have a data frame with three columns. I want to group by two columns (c1,c2) and sum the third one c3. Then I want to pick only the top 1 c1 with maximum c3 (among all c2) i.e. sorting would be unnecessary since I'm only interested in the max.
library(plyr)
df <- data.frame(c1=c('a','a','a','b','b','c'),c2=c('x','y','y','x','y','x'),c3=c(1,2,3,4,5,6))
df
c1 c2 c3
1 a x 1
2 a y 2
3 a y 3
4 b x 4
5 b y 5
6 c x 6
sel <- plyr::ddply(df, c('c1','c2'), plyr::summarize,c3=sum(c3))
sel[with(sel, order(c1,-c3)),]
c1 c2 c3
2 a y 5 <<< this one highest c3 for (c1,c2) combination
1 a x 1
4 b y 5 <<< this one highest c3 for (c1,c2) combination
3 b x 4
5 c x 6 <<< this one highest c3 for (c1,c2) combination
I could do this in a loop but I'm wondering how it can be done in a vector fashion or using a high-level function.
Here's a base R approach:
df2 <- aggregate(c3~c1+c2, df, sum)
subset(df2[order(-df2$c3),], !duplicated(c1))
# c1 c2 c3
#3 c x 6
#4 a y 5
#5 b y 5
Another solution from dplyr.
library(dplyr)
df2 <- df %>%
group_by(c1, c2) %>%
summarise(c3 = sum(c3)) %>%
filter(c3 == max(c3))
df2
# A tibble: 3 x 3
# Groups: c1 [3]
c1 c2 c3
<fctr> <fctr> <dbl>
1 a y 5
2 b y 5
3 c x 6
Here is another option with data.table
library(data.table)
setDT(df)[, .(c3 = sum(c3)) , .(c1, c2)][, .SD[which.max(c3)], .(c1)]
# c1 c2 c3
#1: a y 5
#2: b y 5
#3: c x 6
Using dplyr:
df %>%
group_by(c1, c2) %>%
summarise(c3 = sum(c3)) %>%
top_n(1, c3)
Or the last line can be slice(which.max(c3)), which will guarantee a single row.

Resources