Wrap new row based on condition from column [duplicate] - r

This question already has answers here:
R semicolon delimited a column into rows
(3 answers)
Closed 6 years ago.
I have a dataset that has following format:
id1 a1 b2 x1;x2;x3
id2 a2 b3 x4;x5
id3 a4 b5 x6
id4 a7 b7 x7;x8
First 3 columns (id, a, b) only have 1 instance, but the last column has multiple instances, separated by ;. How can I "wrap" these into new columns? Such as:
id1 a1 b2 x1
id1 a1 b2 x2
id1 a1 b2 x3
id2 a2 b3 x4
id2 a2 b3 x5
id3 a4 b5 x6
id4 a7 b7 x7
id4 a7 b7 x8

We can use cSplit
library(splitstackshape)
cSplit(df1, "v4", ";", "long")
# v1 v2 v3 v4
#1: id1 a1 b2 x1
#2: id1 a1 b2 x2
#3: id1 a1 b2 x3
#4: id2 a2 b3 x4
#5: id2 a2 b3 x5
#6: id3 a4 b5 x6
#7: id4 a7 b7 x7
#8: id4 a7 b7 x8
data
df1 <- structure(list(v1 = c("id1", "id2", "id3", "id4"), v2 = c("a1",
"a2", "a4", "a7"), v3 = c("b2", "b3", "b5", "b7"), v4 = c("x1;x2;x3",
"x4;x5", "x6", "x7;x8")), .Names = c("v1", "v2", "v3", "v4"),
class = "data.frame", row.names = c(NA, -4L))

Related

How do I add 5 blank columns between every other filled column?

I have a dataframe containing N=11 variables. The X's are column names defined by R.
filenames=gtools::mixedsort(c("a1", "a10", "a11", "a2", "a3", "a4", "a5", "a6", "a7", "a8", "a9"))
filenames<-data.frame(t(filenames))
list(filenames)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11
1 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11
How do I insert 6 blank columns before the first filled column (a1.csv), and subsequently 5 blank columns between each of the filled columns? The output should look something like:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18.... X67 X68 X69 X70 X71 X72
1 a1 a2 .... a11
I need a way that can automatically perform the same operation of adding blank columns whatever the value of N.
A bit primitive but this should do what you want:
library(tidyverse)
library(stringr)
library(gtools)
filenames=gtools::mixedsort(c("a1", "a10", "a11", "a2", "a3", "a4", "a5", "a6", "a7", "a8", "a9"))
filenames <- data.frame(t(filenames))
# Note: I am not putting filnames into a list - I use it in the form of a data frame
empty_col <- data.frame(matrix(ncol = 5, nrow = nrow(filenames), data = ""))
for(i in seq_along(filenames)){
filenames <- filenames %>%
add_column(empty_col, .before=(i + (i-1)*ncol(empty_col))) # was .after
}
filenames <- cbind(filenames, empty_col) # this adds 5 columns at the end since .before and .after can't be used in the same function
names(filenames) <- ifelse(str_starts(names(filenames), 'X'), "", names(filenames))
Result:
1 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11

in R4.1.2 How to remove duplicate cells in a row leaving only the first cell

How to remove a repeated duplicate cell in row, leaving only the first cell.
(Remove the 2nd A3)
V1 V2 V3
A1 NA C1
A2 NA C2
A3 A3 C3
A4 NA C4
A5 NA C5
A6 NA C6
A7 NA C7
A8 NA C8
my target
V1 V2 V3
A1 NA C1
A2 NA C2
A3 NA C3
A4 NA C4
A5 NA C5
A6 NA C6
A7 NA C7
A8 NA C8
for(x in nrow(dataset))
{
if(dataset[x,2]%in%dataset[ ,1])
dataset[x, 2]<-NA
}
Something like this I guess
It should work even if the A3 is not in the same row as the A3 in the first column
If you need the target is to remove the same values only in the same row then replace if statement by
if(dataset[x,2]==dataset[x ,1])
A possible solution:
library(tidyverse)
df <- data.frame(
stringsAsFactors = FALSE,
V1 = c("A1", "A2", "A3", "A4", "A5", "A6", "A7", "A8"),
V2 = c(NA, NA, "A3", NA, NA, NA, NA, NA),
V3 = c("C1", "C2", "C3", "C4", "C5", "C6", "C7", "C8")
)
df %>%
mutate(V2 = if_else(V1 == V2, NA_character_, V2))
#> V1 V2 V3
#> 1 A1 <NA> C1
#> 2 A2 <NA> C2
#> 3 A3 <NA> C3
#> 4 A4 <NA> C4
#> 5 A5 <NA> C5
#> 6 A6 <NA> C6
#> 7 A7 <NA> C7
#> 8 A8 <NA> C8
Using replace.
transform(df, V2=replace(V2, V2 %in% V1, NA))
# V1 V2 V3
# 1 A1 <NA> C1
# 2 A2 <NA> C2
# 3 A3 <NA> C3
# 4 A4 <NA> C4
# 5 A5 <NA> C5
# 6 A6 <NA> C6
# 7 A7 <NA> C7
# 8 A8 <NA> C8
Or %in% in Reduce.
df[Reduce(`%in%`, df[1:2]), 'V2'] <- NA
df
# V1 V2 V3
# 1 A1 <NA> C1
# 2 A2 <NA> C2
# 3 A3 <NA> C3
# 4 A4 <NA> C4
# 5 A5 <NA> C5
# 6 A6 <NA> C6
# 7 A7 <NA> C7
# 8 A8 <NA> C8
Data:
df <- structure(list(V1 = c("A1", "A2", "A3", "A4", "A5", "A6", "A7",
"A8"), V2 = c(NA, NA, "A3", NA, NA, NA, NA, NA), V3 = c("C1",
"C2", "C3", "C4", "C5", "C6", "C7", "C8")), row.names = c(NA,
-8L), class = "data.frame")

How to use filter across and str_detect together to filter conditional on mutlitple columns

I have this dataframe:
df <- structure(list(col1 = c("Z2", "A2", "B2", "C2", "A2", "E2", "F2",
"G2"), col2 = c("Z2", "Z2", "A2", "B2", "C2", "D2", "A2", "F2"
), col3 = c("A2", "B2", "C2", "D2", "E2", "F2", "G2", "Z2")), class = "data.frame", row.names = c(NA, -8L))
> df
col1 col2 col3
1 Z2 Z2 A2
2 A2 Z2 B2
3 B2 A2 C2
4 C2 B2 D2
5 A2 C2 E2
6 E2 D2 F2
7 F2 A2 G2
8 G2 F2 Z2
I would like to use explicitly filter, across and str_detect in a tidyverse setting to filter all rows that start with an A over col1:col3.
Expected result:
col1 col2 col3
1 Z2 Z2 A2
2 A2 Z2 B2
3 B2 A2 C2
4 A2 C2 E2
5 F2 A2 G2
I have tried:
library(dplyr)
library(stringr)
df %>%
filter(across(c(col1, col2, col3), ~str_detect(., "^A")))
This gives:
[1] col1 col2 col3
<0 Zeilen> (oder row.names mit Länge 0)
I want to learn why this code is not working using filter, across and str_detect!
We can use if_any as across will look for & condition i.e. all columns should meet the condition for a particular row to get filtered
library(dplyr)
library(stringr)
df %>%
filter(if_any(everything(), ~str_detect(., "^A")))
-output
col1 col2 col3
1 Z2 Z2 A2
2 A2 Z2 B2
3 B2 A2 C2
4 A2 C2 E2
5 F2 A2 G2
According to ?across
if_any() and if_all() apply the same predicate function to a selection of columns and combine the results into a single logical vector: if_any() is TRUE when the predicate is TRUE for any of the selected columns, if_all() is TRUE when the predicate is TRUE for all selected columns.
across() supersedes the family of "scoped variants" like summarise_at(), summarise_if(), and summarise_all().
The if_any/if_all are not part of the scoped variants

R: Merge Data While Retaining Values for One Dataset in Duplicates

I have two data sets, data1 and data2:
data1 <- data.frame(ID = 1:6,
A = c("a1", "a2", NA, "a4", "a5", NA),
B = c("b1", "b2", "b3", NA, "b5", NA),
stringsAsFactors = FALSE)
data1
ID A B
1 a1 b1
2 a2 b2
3 NA b3
4 a4 NA
5 a5 b5
6 NA NA
and
data2 <- data.frame(ID = 1:6,
A = c(NA, "a2", "a3", NA, "a5", "a6"),
B = c(NA, "b2.wrong", NA, "b4", "b5", "b6"),
stringsAsFactors = FALSE)
data2
ID A B
1 NA NA
2 a2 b2.wrong
3 a3 NA
4 NA b4
5 a5 b5
6 a6 b6
I would like to merge them by ID so that the resultant merged dataset, data.merged, populates fields form both datasets, but chooses values from data1 whenever there are possible values from both datasets.
I.e., I would like the final dataset, data.merge, to be:
ID A B
1 a1 b1
2 a2 b2
3 a3 b3
4 a4 b4
5 a5 b5
6 a6 b6
I have looked around, finding similar but not exact answers.
You can join the data and use coalesce to select the first non-NA value.
library(dplyr)
data1 %>%
inner_join(data2, by = 'ID') %>%
mutate(A = coalesce(A.x, A.y),
B = coalesce(B.x, B.y)) %>%
select(names(data1))
# ID A B
#1 1 a1 b1
#2 2 a2 b2
#3 3 a3 b3
#4 4 a4 b4
#5 5 a5 b5
#6 6 a6 b6
Or in base R comparing values with NA :
transform(merge(data1, data2, by = 'ID'),
A = ifelse(is.na(A.x), A.y, A.x),
B = ifelse(is.na(B.x), B.y, B.x))[names(data1)]

Order by letters and numbers

I have a DF$vector which looks like this:
A10 A50
C1 C4
B1
A7
C3
B1 B4
I look for a way to order it as follows:
A10 A50
A7
B1 B4
B1
C1 C4
C3
I tried to use gsub :
vector[order(gsub("([A-Z]+)([0-9]+)", "\\1", vector),
as.numeric(gsub("([A-Z]+)([0-9]+)", "\\2", vector)))]
But it didnt return what i want.
Thank you for any suggestions.
We can use order from base R
df1[order(sub("\\d+", "", df1[,1]), as.numeric(sub("\\D+", "", df1[,1])), df1[,2] == ""),]
# A10 A50
#3 A7
#5 B1 B4
#2 B1
#1 C1 C4
#4 C3
data
df1 <-structure(list(A10 = c("C1", "B1", "A7", "C3", "B1"), A50 = c("C4",
"", "", "", "B4")), .Names = c("A10", "A50"), class = "data.frame",
row.names = c(NA, -5L))
In programming languages, the letters are considered to be increasing in terms of magnitude. Thus A is considered to be lessthan Betc. Thus to order the above, just use the code:
df1$r=rank(df1$A10,ties.method = "last")
df1[order(df1$r),-ncol(df1)]
A10 A50
3 A7
5 B1 B4
2 B1
1 C1 C4
4 C3

Resources