I have one dataframe with some duplicated rows, which I want to join only duplicated rows. Given an example below:
name b c d
1 yp 3 NA NA
2 yp 3 1 NA
3 IG NA 3 NA
4 OG 4 1 0
the duplicated rows are defined by the rows which have the same name. Thus in this example, row 1 and row 2 need to be join somehow, with the NA values replaced by possible numerical value.
name b c d
1 yp 3 1 NA
2 IG NA 3 NA
3 OG 4 1 0
Assumption: if two rows have the same name, and their corresponding columns are not NA, then the corresponding column values must be the same numerical value.
Here's a dplyr approach:
library(dplyr)
df %>% group_by(name) %>% summarise_each(funs(first(.[!is.na(.)])))
#Source: local data frame [3 x 4]
#
# name b c d
# (fctr) (int) (int) (int)
#1 IG NA 3 NA
#2 OG 4 1 0
#3 yp 3 1 NA
This groups the data by "name" and for each unique name, returns a single row and in each of the other columns returns the first value that is not NA or, NA if all entries are NAs. This is in line with the assumption that if several numerical values are present, they must all be the same (and hence, we can pick the first one).
Perhaps you can try something like the following:
setDT(mydf)[, lapply(.SD, function(x) {
if (all(is.na(x))) NA else x[!is.na(x)][1]
}), by = name]
# name b c d
# 1: yp 3 1 NA
# 2: IG NA 3 NA
# 3: OG 4 1 0
Basically, if all values are NA, just take the the first NA value, or else, take the first non-NA value.
As pointed out by #docendodiscimus, this can be simplified to:
setDT(mydf)[, lapply(.SD, function(x) x[!is.na(x)][1]), by = name]
A quick way to solve this would be to use the dplyr package and group the on the variables you want to join on and then handle how to join the rows.
A good way to join the rows could be to take the mean of all but the NA values.
In your case the code would be:
library(dplyr)
df %>% group_by(name) %>%
summarise_each(funs(mean, "mean", mean(., na.rm = TRUE)))
Related
I have the following dataframe:
ID Parts
-- -----
1 A:B::
2 X2:::
3 ::J4:
4 A:C:D:G4:X6
And I would like the convert the Parts column into multiple columns by the : delimiter. so it should look like:
ID A B X2 J4 C D G4 X6 ........
-- - - -- -- - - -- --
1 A B na na na na na na
2 na na X2 na na na na na
3 na na na J4 na na na na
4 A na na na C D G4 X6
where there I would not know the number of potential columns in advance.
I have met my match on this one - strsplit() by delim I can do but only with fixed number of entities in the Parts column
You can use a combination of tidyr::seperate, tidyr::pivot_wider, and tidyr::pivot_longer. First you can still use strsplit to determine the number of columns to split Parts into not the number of unique values (How it works):
library(dplyr)
library(tidyr)
library(stringr)
n_col <- max(stringr::str_count(df$Parts, ":")) + 1
df %>%
tidyr::separate(Parts, into = paste0("col", 1:n_col), sep = ":") %>%
dplyr::mutate(across(everything(), ~dplyr::na_if(., ""))) %>%
tidyr::pivot_longer(-ID) %>%
dplyr::select(-name) %>%
tidyr::drop_na() %>%
tidyr::pivot_wider(id_cols = ID,
names_from = value)
ID A B X2 J4 C D G4 X6
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 A B NA NA NA NA NA NA
2 2 NA NA X2 NA NA NA NA NA
3 3 NA NA NA J4 NA NA NA NA
4 4 A NA NA NA C D G4 X6
How it works
You do not need to know the number of unique values with this code -- the pivots take care of that. What you do need to know is how many new columns Parts will be split into with seperate. That's easy to do by counting the number of delimiters and adding one with str_count. This way you have the appropriate number of columns to seperate Parts into by your delimiter.
This is because pivot_longer will create a two column dataframe with repeated ID and a column with the delimited values of Parts -- an ID, Parts pairing. Then when you use pivot_wider the columns are automatically created for each unique value of Parts and the value is retained within the column. This function automatically fills with NA where an ID and Parts combination is not found.
Try running this pipe by pipe to better understand if need be.
Data
lines <- "
ID Parts
1 A:B::
2 X2:::
3 ::J4:
4 A:C:D:G4:X6
"
df <- read.table(text = lines, header = T)
Could the seperate function from tidyr be what you are looking for?
https://tidyr.tidyverse.org/reference/separate.html
It might require some fancy regex implementation, but could potentially work.
Instead of copy and paste corresponding columns into excel, I want to amend several columns in a consecutive way based on serial ID named addr.
Assume my data sets are like these
df1 <- data.frame(addr=c('a','b','c','d'),
num = c(1,2,3,4),
x=c(1, NA,4,5));df1
df2 <- data.frame(addr=c('e','f','g'),
num=c(100,200,500));df2
var<-intersect(names(df), names(df2));var
combined.df<-merge(x = df1, y = df2, by = var, all=T);combined.df
df3 <- data.frame(addr=c('e','f','g'),
x=c(5,7,NA));df3
var<-intersect(names(df3), names(combined.df));var
combined.df<-merge(x = combined.df, y = df3, by = var, all=T);combined.df
The current output is
addr x num
1 a 1 1
2 b NA 2
3 c 4 3
4 d 5 4
5 e 5 NA
6 e NA 100
7 f 7 NA
8 f NA 200
9 g NA 500
The desired output is
addr x num
1 a 1 1
2 b NA 2
3 c 4 3
4 d 5 4
5 e 5 100
6 f 7 200
7 g NA 500
i.e.: Overwrite empty columns without deleting prior full cells
Any advice will be greatly appreciated
If we want to automate using a for loop, place the datasets in a list except the first one, then create a copy of the first dataset as 'out', loop over the sequence of the list, merge the first one i.e 'out' with the corresponding list elements, specify the by as intersect of names of both datasets and update by assigning (<-) back to the 'out'
out <- df1
lst1 <- list(df2, df3)
for(i in seq_along(lst1)) {
out <- merge(out, lst1[[i]],
by = intersect(names(out), names(lst1[[i]])), all = TRUE)
}
Then, we change the output by grouping over the 'addr', and summarise across all other columns by removing the NA if there exist a non-NA element
library(dplyr)
out %>%
group_by(addr) %>%
summarise(across(everything(),
~ if(all(is.na(.))) NA_real_ else .[!is.na(.)]), .groups = 'drop')
-output
# addr x num
# <chr> <dbl> <dbl>
#1 a 1 1
#2 b NA 2
#3 c 4 3
#4 d 5 4
#5 e 5 100
#6 f 7 200
#7 g NA 500
I'm trying to pivot_longer 34 columns of a data set with about 10,000 rows in R. The data was collected via survey, and each column represents a possible answer to a question. I want to pivot_longer one of the questions, which had 34 possible answers, and account for 34/107 columns. The columns have a value (1) if that answer was selected, and the other 33 rows have NA.
Example subset of data frame for a question with 5 possible answers (df):
ID A B C D E
1 1 NA NA NA NA
2 NA 1 NA NA NA
3 NA NA NA NA 1
4 NA NA NA NA NA
5 NA 1 NA NA NA
I need to get to:
ID Answer
1 A
2 B
3 E
4 NA
5 B
I want to pivot_longer the results to this question, while maintaining all the other columns. The issue occurs because some people didn't answer this question, resulting in all NA's (See row 4).
I'm using the code:
dfNew <- pivot_longer(df, c(A,B,C,D,E), names_to = "Answer", values_drop_na = TRUE)
dfNew
ID Answer
1 A
2 B
3 E
5 B
Which removes ID 4 from the data. Not using values_drop_na results in having a row for every NA value in A:E. How do I get it to maintain ID 4 as part of the data set, and make the value for Answer NA?
You can use complete to fill the missing values :
library(tidyr)
pivot_longer(df, A:E, names_to = "Answer", values_drop_na = TRUE) %>%
complete(ID = unique(df$ID)) %>%
dplyr::select(-value)
# A tibble: 5 x 2
# ID Answer
# <int> <chr>
#1 1 A
#2 2 B
#3 3 E
#4 4 NA
#5 5 B
You can also use max.col here :
cbind(df[1], answer = names(df)[-1][max.col(!is.na(df[-1])) *
NA^ !rowSums(!is.na(df[-1]), na.rm = TRUE)])
This might be quite difficult to understand.
max.col(!is.na(df[-1])) returns the index of non-NA value in each row but in case the row has all NA's it returns any index.
NA^ !rowSums(!is.na(df[-1])) this part returns NA for rows where there are all NA's and 1 for rows which has atleast 1 non-NA.
When we multiply 1 * 2 we get NA's for all NA's row and row-index where there is a value.
max.col(!is.na(df[-1])) * NA^ !rowSums(!is.na(df[-1]), na.rm = TRUE)
#[1] 1 2 5 NA 2
4 . We use these (above) values to subset column names from df to get answer.
names(df[-1])[max.col(!is.na(df[-1]))*NA^!rowSums(!is.na(df[-1]), na.rm = TRUE)]
#[1] "A" "B" "E" NA "B"
I have data like one in the picture where there are two columns (Cday,Dday) with some missing values.
There can't be a row where there are values for both columns; there's a value on either one column or the other or in neither.
I want to create the column "new" that has copied values from whichever column there was a number.
Really appreciate any help!
Since no row has a value for both, you can just sum up the two existing columns. Assume your dataframe is called df.
df$'new' = rowSums(df[,2:3], na.rm=T)
This will sum the rows, removing NAs and should give you what you want. (Note: you may need to adjust column numbering if you have more columns than what you've shown).
The dplyr package has the coalesce function.
library(dplyr)
df <- data.frame(id=1:8, Cday=c(1,2,NA,NA,3,NA,2,NA), Dday=c(NA,NA,NA,3,NA,2,NA,1))
new <- df %>% mutate(new = coalesce(Dday, Cday, na.rm=T))
new
# id Cday Dday new
#1 1 1 NA 1
#2 2 2 NA 2
#3 3 NA NA NA
#4 4 NA 3 3
#5 5 3 NA 3
#6 6 NA 2 2
#7 7 2 NA 2
#8 8 NA 1 1
I have a dataframe with multiple columns and I want to replace NAs in one column if they are between two rows with an identical number. Here is my data:
v1 v2
1 2
NA 3
NA 2
1 1
NA 7
NA 2
3 1
I basically want to start from the beginning of the data frame and replcae NAs in column v1 with previous Non NA if the next Non NA matches the previous one. That been said, I want the result to be like this:
v1 v2
1 2
1 3
1 2
1 1
NA 7
NA 2
3 1
As you may see, rows 2 and 3 are replaced with number "1" because row 1 and 4 had an identical number but rows 5,6 stays the same because the non na values in rows 4 and 7 are not identical. I have been twicking a lot but so far no luck. Thanks
Here is an idea using zoo package. We basically fill NAs in both directions and set NA the values that are not equal between those directions.
library(zoo)
ind1 <- na.locf(df$v1, fromLast = TRUE)
df$v1 <- na.locf(df$v1)
df$v1[df$v1 != ind1] <- NA
which gives,
v1 v2
1 1 2
2 1 3
3 1 2
4 1 1
5 NA 7
6 NA 2
7 3 1
Here is a similar approach in tidyverse using fill
library(tidyverse)
df1 %>%
mutate(vNew = v1) %>%
fill(vNew, .direction = 'up') %>%
fill(v1) %>%
mutate(v1 = replace(v1, v1 != vNew, NA)) %>%
select(-vNew)
# v1 v2
#1 1 2
#2 1 3
#3 1 2
#4 1 1
#5 NA 7
#6 NA 2
#7 3 1
Here is a base R solution, the logic is almost the same as Sotos's one:
replace_na <- function(x){
f <- function(x) ave(x, cumsum(!is.na(x)), FUN = function(x) x[1])
y <- f(x)
yp <- rev(f(rev(x)))
ifelse(!is.na(y) & y == yp, y, x)
}
df$v1 <- replace_na(df$v1)
test:
> replace_na(c(1, NA, NA, 1, NA, NA, 3))
[1] 1 1 1 1 NA NA 3
I could use na.locf function to do so. Basically, I use the normal na.locf function package zoo to replace each NA with the latest previous non NA and store the data in a column. by using the same function but fixing fromlast=TRUE NAs are replaces with the first next nonNA and store them in another column. I checked these two columns and if the results in each row for these two columns are not matching I replace them with NA.