Convert a single column into multiple columns based on delimiter in R - r

I have the following dataframe:
ID Parts
-- -----
1 A:B::
2 X2:::
3 ::J4:
4 A:C:D:G4:X6
And I would like the convert the Parts column into multiple columns by the : delimiter. so it should look like:
ID A B X2 J4 C D G4 X6 ........
-- - - -- -- - - -- --
1 A B na na na na na na
2 na na X2 na na na na na
3 na na na J4 na na na na
4 A na na na C D G4 X6
where there I would not know the number of potential columns in advance.
I have met my match on this one - strsplit() by delim I can do but only with fixed number of entities in the Parts column

You can use a combination of tidyr::seperate, tidyr::pivot_wider, and tidyr::pivot_longer. First you can still use strsplit to determine the number of columns to split Parts into not the number of unique values (How it works):
library(dplyr)
library(tidyr)
library(stringr)
n_col <- max(stringr::str_count(df$Parts, ":")) + 1
df %>%
tidyr::separate(Parts, into = paste0("col", 1:n_col), sep = ":") %>%
dplyr::mutate(across(everything(), ~dplyr::na_if(., ""))) %>%
tidyr::pivot_longer(-ID) %>%
dplyr::select(-name) %>%
tidyr::drop_na() %>%
tidyr::pivot_wider(id_cols = ID,
names_from = value)
ID A B X2 J4 C D G4 X6
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 A B NA NA NA NA NA NA
2 2 NA NA X2 NA NA NA NA NA
3 3 NA NA NA J4 NA NA NA NA
4 4 A NA NA NA C D G4 X6
How it works
You do not need to know the number of unique values with this code -- the pivots take care of that. What you do need to know is how many new columns Parts will be split into with seperate. That's easy to do by counting the number of delimiters and adding one with str_count. This way you have the appropriate number of columns to seperate Parts into by your delimiter.
This is because pivot_longer will create a two column dataframe with repeated ID and a column with the delimited values of Parts -- an ID, Parts pairing. Then when you use pivot_wider the columns are automatically created for each unique value of Parts and the value is retained within the column. This function automatically fills with NA where an ID and Parts combination is not found.
Try running this pipe by pipe to better understand if need be.
Data
lines <- "
ID Parts
1 A:B::
2 X2:::
3 ::J4:
4 A:C:D:G4:X6
"
df <- read.table(text = lines, header = T)

Could the seperate function from tidyr be what you are looking for?
https://tidyr.tidyverse.org/reference/separate.html
It might require some fancy regex implementation, but could potentially work.

Related

How to replace values in dataframe based on a second dataframe in R?

I have a dataframe df1 with multiple columns, each column representing a species name (sp1, sp2, sp3, ...).
df1
sp1 sp2 sp3 sp4
NA NA r1 r1
NA NA 1 3
NA 5 NA NA
m4 NA NA m2
I would like to replace each value in df1 with values based on a second dataframe, df2. Here, the values in df1 should match df2$scale_nr, and replaced by df2$percentage. Thus, the result should be so that I have my values in df1 based on $percentage in df2.
df2
scale_nr percentage
r1 1
p1 1
a1 1
m1 1
r2 2
p2 2
a2 2
m2 2
1 10
2 20
3 30
4 40
...
Then after replacement df1 should look like
df1
sp1 sp2 sp3 sp4
NA NA 1 1
NA NA 10 30
NA 50 NA NA
4 NA NA 2
I tried this:
df2$percentage[match(df1$sp1, df2$scale_nr)] # this one works for one column
which works for one column, I know I should be able to do this over all columns easily, but somehow I can't figure it out.
I know I could do it by 'hand', like
df[df == 'Old Value'] <- 'New value'
but this seems highly inefficient because I have 40 different values that need to be replaced.
Can someone please help me with a solution for this?
You can use lapply on the frame to iterate the same thing over multiple columns.
df1[] <- lapply(df1, function(z) df2$percentage[match(z, df2$scale_nr)])
df1
# sp1 sp2 sp3 sp4
# 1 NA NA 1 1
# 2 NA NA 10 30
# 3 NA NA NA NA
# 4 NA NA NA 2
The missing values are likely because of the truncated df2 in the sample data.
If you want the option to preserve the previous value if not found in df2, then you can modify that slightly:
df1[] <- lapply(df1, function(z) {
newval <- df2$percentage[match(z, df2$scale_nr)]
ifelse(is.na(newval), z, newval)
})
df1
# sp1 sp2 sp3 sp4
# 1 <NA> NA 1 1
# 2 <NA> NA 10 30
# 3 <NA> 5 <NA> <NA>
# 4 m4 NA <NA> 2
FYI, the reassignment into df1[] <- is important, in constrast with df1 <-. The difference is that lapply is going to return a list, so if you use df1 <-thendf1will no longer be adata.frame. Using df[] <-, you are telling it to replace the contents of the columns without changing the overall class of df1`.
If you need to do this on only a subset of columns, that's easy:
df1[1:3] <- lapply(df[1:3], ...)`

Accounting for NA using Pivot_longer in R

I'm trying to pivot_longer 34 columns of a data set with about 10,000 rows in R. The data was collected via survey, and each column represents a possible answer to a question. I want to pivot_longer one of the questions, which had 34 possible answers, and account for 34/107 columns. The columns have a value (1) if that answer was selected, and the other 33 rows have NA.
Example subset of data frame for a question with 5 possible answers (df):
ID A B C D E
1 1 NA NA NA NA
2 NA 1 NA NA NA
3 NA NA NA NA 1
4 NA NA NA NA NA
5 NA 1 NA NA NA
I need to get to:
ID Answer
1 A
2 B
3 E
4 NA
5 B
I want to pivot_longer the results to this question, while maintaining all the other columns. The issue occurs because some people didn't answer this question, resulting in all NA's (See row 4).
I'm using the code:
dfNew <- pivot_longer(df, c(A,B,C,D,E), names_to = "Answer", values_drop_na = TRUE)
dfNew
ID Answer
1 A
2 B
3 E
5 B
Which removes ID 4 from the data. Not using values_drop_na results in having a row for every NA value in A:E. How do I get it to maintain ID 4 as part of the data set, and make the value for Answer NA?
You can use complete to fill the missing values :
library(tidyr)
pivot_longer(df, A:E, names_to = "Answer", values_drop_na = TRUE) %>%
complete(ID = unique(df$ID)) %>%
dplyr::select(-value)
# A tibble: 5 x 2
# ID Answer
# <int> <chr>
#1 1 A
#2 2 B
#3 3 E
#4 4 NA
#5 5 B
You can also use max.col here :
cbind(df[1], answer = names(df)[-1][max.col(!is.na(df[-1])) *
NA^ !rowSums(!is.na(df[-1]), na.rm = TRUE)])
This might be quite difficult to understand.
max.col(!is.na(df[-1])) returns the index of non-NA value in each row but in case the row has all NA's it returns any index.
NA^ !rowSums(!is.na(df[-1])) this part returns NA for rows where there are all NA's and 1 for rows which has atleast 1 non-NA.
When we multiply 1 * 2 we get NA's for all NA's row and row-index where there is a value.
max.col(!is.na(df[-1])) * NA^ !rowSums(!is.na(df[-1]), na.rm = TRUE)
#[1] 1 2 5 NA 2
4 . We use these (above) values to subset column names from df to get answer.
names(df[-1])[max.col(!is.na(df[-1]))*NA^!rowSums(!is.na(df[-1]), na.rm = TRUE)]
#[1] "A" "B" "E" NA "B"

copy values from different columns based on conditions (r code)

I have data like one in the picture where there are two columns (Cday,Dday) with some missing values.
There can't be a row where there are values for both columns; there's a value on either one column or the other or in neither.
I want to create the column "new" that has copied values from whichever column there was a number.
Really appreciate any help!
Since no row has a value for both, you can just sum up the two existing columns. Assume your dataframe is called df.
df$'new' = rowSums(df[,2:3], na.rm=T)
This will sum the rows, removing NAs and should give you what you want. (Note: you may need to adjust column numbering if you have more columns than what you've shown).
The dplyr package has the coalesce function.
library(dplyr)
df <- data.frame(id=1:8, Cday=c(1,2,NA,NA,3,NA,2,NA), Dday=c(NA,NA,NA,3,NA,2,NA,1))
new <- df %>% mutate(new = coalesce(Dday, Cday, na.rm=T))
new
# id Cday Dday new
#1 1 1 NA 1
#2 2 2 NA 2
#3 3 NA NA NA
#4 4 NA 3 3
#5 5 3 NA 3
#6 6 NA 2 2
#7 7 2 NA 2
#8 8 NA 1 1

Tidying in R: how to collapse my binary columns into characters, based on vectors?

I am tidying my data in R, and want to turn multiple columns into 1, using a function iterating over the items of a vector. I was wondering whether you could help me out to:
work away a semantic error,
and make my code more efficient?
My data is based on a survey with 32 questions. Each question has multiple answers. Each answer is a column, with options 1 and NA.
For one question, a section of the dataset can be reproduced as follows:
XV2_1 <- c(1,NA,NA,NA)
XV2_2 <- c(NA,1,NA,NA)
XV2_3 <- c(NA,NA,NA,1)
XV2_4 <- c(NA,NA,1,NA)
id <- c(12,13,14,15)
dat <- data.frame(id,XV2_1, XV2_2, XV2_3,XV2_4)
> dat
id XV2_1 XV2_2 XV2_3 XV2_4
1 12 1 NA NA NA
2 13 NA 1 NA NA
3 14 NA NA NA 1
4 15 NA NA 1 NA
This is the data I would like to have (
question_2_answers <- c("Yellow","Blue","Green","Orange") #this is a vector based on the answers of the questionnaire
collapsed <- c("Yellow","Blue","Orange","Green")
collapsed_dataframe <- data.frame(id,collapsed)
>collapsed_dataframe
id X2
1 12 Yellow
2 13 Blue
3 14 Green
4 15 Orange
So far, I tried a sequence of "ifelse's" combined with mutate:
library(tidyverse)
question_2_answers <- c("Yellow","Blue","Green","Orange") #this is a vector based on the answers of the questionnaire
dat %>%
mutate(
Colour = tidy_Q2(question_2_answers,XV2_1,XV2_2,XV2_3,XV2_4)
)
tidy_Q2 <- function(a,b,c,d,e) {
ifelse(b == 1, a[1],ifelse(
c==1,a[2],ifelse(
d==1,a[3],a[4])))
}
However, my output is not as expected:
id XV2_1 XV2_2 XV2_3 XV2_4 Colour
1 12 1 NA NA NA Yellow
2 13 NA 1 NA NA <NA>
3 14 NA NA NA 1 <NA>
4 15 NA NA 1 NA <NA>
I would have liked it to be as follows:
id XV2_1 XV2_2 XV2_3 XV2_4 Colour
1 12 1 NA NA NA Yellow
2 13 NA 1 NA NA Blue
3 14 NA NA NA 1 Green
4 15 NA NA 1 NA Orange
Does anyone know a way to remove the error?
Another question that I'd like to ask, is whether my code can be more efficient? I have 32 survey_questions in store after this, I'd like to automate the process as much as possible. Notable things to take in mind:
not all survey questions have the same amount of options (i.e. question 2 has 2 options and therefore 2 columns, whilst question 10 has 8 options and 8 columns)
some values are strings, instead of 1 or NA
Always happy to learn,
Best,
Maria
This is a kind of wide-to-long conversion which we can do with tidyr::gather:
First, we make the colors the column names of the appropriate rows:
# Replace column names (except for the `id` column) with color values
colnames(dat)[-1] <- c("Yellow","Blue","Orange","Green")
dat
id Yellow Blue Orange Green
1 12 1 NA NA NA
2 13 NA 1 NA NA
3 14 NA NA NA 1
4 15 NA NA 1 NA
Then, we gather the non-id columns and drop the NA values:
library(tidyverse)
dat %>%
gather(X2, val, -id) %>% # Gather color cols from wide to long format
filter(!is.na(val)) %>% # Drop rows with NA values
select(-val) # Remove the unnecessary `val` column
id X2
1 12 Yellow
2 13 Blue
3 15 Orange
4 14 Green
This will work with any number of columns (you just need to specify all columns you don't want to gather) and keeps rows with non-NA values. If you want other conditions to exclude a row (for example, if 0 or 'unknown' should count as a non-answer, or only 'correct' counts as an answer) then you should add those conditions to the filter statement.
One option in base R would be max.col is to find the column index of values that are not NA in each row, use that to get the column names corresponding to the index, create a 2 column data.frame by cbinding with the first column
i1 <- max.col(!is.na(dat[-1]), 'first')
cbind(dat['id'], Colour = names(dat)[-1][i1])
# id Colour
#1 12 Yellow
#2 13 Blue
#3 14 Green
#4 15 Orange
data
dat <- structure(list(id = c(12, 13, 14, 15), Yellow = c(1, NA, NA,
NA), Blue = c(NA, 1, NA, NA), Orange = c(NA, NA, NA, 1), Green = c(NA,
NA, 1, NA)), class = "data.frame", row.names = c(NA, -4L))

R: how to join the duplicate rows in one dataframe

I have one dataframe with some duplicated rows, which I want to join only duplicated rows. Given an example below:
name b c d
1 yp 3 NA NA
2 yp 3 1 NA
3 IG NA 3 NA
4 OG 4 1 0
the duplicated rows are defined by the rows which have the same name. Thus in this example, row 1 and row 2 need to be join somehow, with the NA values replaced by possible numerical value.
name b c d
1 yp 3 1 NA
2 IG NA 3 NA
3 OG 4 1 0
Assumption: if two rows have the same name, and their corresponding columns are not NA, then the corresponding column values must be the same numerical value.
Here's a dplyr approach:
library(dplyr)
df %>% group_by(name) %>% summarise_each(funs(first(.[!is.na(.)])))
#Source: local data frame [3 x 4]
#
# name b c d
# (fctr) (int) (int) (int)
#1 IG NA 3 NA
#2 OG 4 1 0
#3 yp 3 1 NA
This groups the data by "name" and for each unique name, returns a single row and in each of the other columns returns the first value that is not NA or, NA if all entries are NAs. This is in line with the assumption that if several numerical values are present, they must all be the same (and hence, we can pick the first one).
Perhaps you can try something like the following:
setDT(mydf)[, lapply(.SD, function(x) {
if (all(is.na(x))) NA else x[!is.na(x)][1]
}), by = name]
# name b c d
# 1: yp 3 1 NA
# 2: IG NA 3 NA
# 3: OG 4 1 0
Basically, if all values are NA, just take the the first NA value, or else, take the first non-NA value.
As pointed out by #docendodiscimus, this can be simplified to:
setDT(mydf)[, lapply(.SD, function(x) x[!is.na(x)][1]), by = name]
A quick way to solve this would be to use the dplyr package and group the on the variables you want to join on and then handle how to join the rows.
A good way to join the rows could be to take the mean of all but the NA values.
In your case the code would be:
library(dplyr)
df %>% group_by(name) %>%
summarise_each(funs(mean, "mean", mean(., na.rm = TRUE)))

Resources