Merging columns of a dataframe in R - r

I have the following data frame,
c1 <- c(1,2,"<NA>","<NA>")
c2 <- c("<NA>","<NA>",3,4)
df <- data.frame(c1,c2)
>df
c1 c2
1 1 <NA>
2 2 <NA>
3 <NA> 3
4 <NA> 4
The following is the desired output that I'm trying to obtain after merging columns 1 ,2
>df
c1
1 1
2 2
3 3
4 4
I tried,
df <- mutate(df, x =paste(c1,c2))
which gives
> df
c1 c2 x
1 1 <NA> 1 <NA>
2 2 <NA> 2 <NA>
3 <NA> 3 <NA> 3
4 <NA> 4 <NA> 4
Could someone give suggestions on how to obtain the desired output?

One way is this:
c1 <- c(1, 2, NA, NA)
c2 <- c(NA, NA, 3, 4)
df <- data.frame(c1, c2)
df2 <- data.frame(
c1 = ifelse(is.na(df$c1), df$c2, df$c1)
)
#df2
# c1
#1 1
#2 2
#3 3
#4 4

You are close, but you are pasting together two strings where one uses the string NA in angled brackets to represent nothing, and if you are pasting strings together and want a string to not appear in the pasted string you need to have it as a zero length string. You can do this using the recode command in dplyr.
You can modify your code to be:
library(dplyr)
df <- mutate(df, x =paste0(recode(c1,"<NA>" = ""),recode(c2,"<NA>" = "")))

Another way using dplyr from tidyverse:
df2 <- df %>%
mutate(c3 = if_else(is.na(c1),c2,c1)) %>%
select(-c1, -c2) %>% # Given you only wanted one column
rename(c1 = c3) # Given you wanted the column to be called c1
Output:
c1
1 1
2 2
3 3
4 4

You could use rowSums :
data.frame(c1 = rowSums(df,na.rm = TRUE))
# c1
# 1 1
# 2 2
# 3 3
# 4 4

Since it seems the the dataframe actually contains NA values rather than '<NA>' strings, I would suggest to use coalesce:
c1 <- c(1,2,NA, NA)
c2 <- c(NA, NA,3,4)
df <- data.frame(c1,c2)
library(tidyverse)
df %>%
mutate(c3=coalesce(c1, c2))
Output:
c1 c2 c3
1 1 NA 1
2 2 NA 2
3 NA 3 3
4 NA 4 4

Related

Transforming an R Dataframe with 2 columns and delimiter in rows

I have a dataframe that has two columns "id" and "detail" (df_current below). I need to group the dataframe by id, and spread the file so that the columns become "Interface1", "Interface2", etc. and the contents under the interface columns are the immediate values under each time the interface value appears. Essentially the "!" is working as a separator, but it is not needed in the output.
The desired output is shown below as: "df_needed_from_current".
I have tried multiple approaches (group_by, spread, reshape, dcast etc.), but can't get it to work. Any help would be greatly appreciated!
Sample Current Dataframe (code to create under):
id
detail
1
!
1
Interface1
1
a
1
b
1
!
1
Interface2
1
a
1
b
2
!
2
Interface1
2
a
2
b
2
c
2
!
2
Interface2
2
a
3
!
3
Interface1
3
a
3
b
3
c
3
d
df_current <- data.frame(
id = c("1","1","1","1","1","1","1","1","2",
"2","2","2","2","2","2","2","3","3",
"3","3","3","3","4","4","4","4","4",
"4","4","4","4","4","4","4","4","4",
"5","5","5","5","5","5","5","5","5",
"5","5","5","5"),
detail = c("!", "Interface1","a","b","!",
"Interface2","a","b","!","Interface1",
"a","b","c","!","Interface2","a",
"!", "Interface1","a","b","c","d",
"!", "Interface1","a","b","!",
"Interface2","a","b","c","!","Interface3",
"a","b","c","!","Interface1","a","b","!",
"Interface2","a","b","c","!","Interface3",
"a","b"))
Dataframe Needed (code to create under):
ID
Interface1
Interface2
Interface3
1
a
a
NA
1
b
b
NA
2
a
a
NA
2
b
NA
NA
2
c
NA
NA
3
a
NA
NA
3
b
NA
NA
3
c
NA
NA
3
d
NA
NA
df_needed_from_current <- data.frame(
id = c("1","1","2","2","2","3","3","3","3","4","4","4","5","5","5"),
Interface1 = c("a","b","a","b","c","a","b","c","d","a","b","NA","a","b","NA"),
Interface2 = c("a","b","a","NA","NA","NA","NA","NA","NA","a","b","c","a","b","c"),
Interface3 = c("NA","NA","NA","NA","NA","NA","NA","NA","NA","a","b","c","a","b","NA")
)
We remove the rows where the 'detail' values is "!", then create a new column 'interface' with only values that have prefix 'Interface' from 'detail', use fill from tidyr to fill the NA elements with the previous non-NA, filter the rows where the 'detail' values are not the same as 'interface' column, create a row sequence id with rowid(from data.table) and reshape to 'wide' format with pivot_wider
library(dplyr)
library(tidyr)
library(data.table)
library(stringr)
df_current %>%
filter(detail != "!") %>%
mutate(interface = case_when(str_detect(detail, 'Interface') ~ detail)) %>%
group_by(id) %>%
fill(interface) %>%
ungroup %>%
filter(detail != interface) %>%
mutate(rn = rowid(id, interface)) %>%
pivot_wider(names_from = interface, values_from = detail) %>%
select(-rn)
# A tibble: 15 x 4
# id Interface1 Interface2 Interface3
# <chr> <chr> <chr> <chr>
# 1 1 a a <NA>
# 2 1 b b <NA>
# 3 2 a a <NA>
# 4 2 b <NA> <NA>
# 5 2 c <NA> <NA>
# 6 3 a <NA> <NA>
# 7 3 b <NA> <NA>
# 8 3 c <NA> <NA>
# 9 3 d <NA> <NA>
#10 4 a a a
#11 4 b b b
#12 4 <NA> c c
#13 5 a a a
#14 5 b b b
#15 5 <NA> c <NA>

Extracting values from a long string and create new columns based on the number of brackets in r

substr() might be a great way to extract values with conditions (in our case, the condition is extracting values from brackets), but is there any handy way to extract multiple of them and create multiple columns (new columns number is the same as the number of extracted values).
Here is one example data:
index abc
1 1 qwer(urt123) qweqwe
2 2 rte(ret390) qweqwe(tertr213) ityorty(ret435)
3 3 <NA>
4 4 ogi(wqe685) qwe(ieow123)
5 5 cvb(bnm567)
code for creating the question data frame:
data.frame(index = c(1:5),
abc = c("qwer(urt123) qweqwe", "rte(ret390) qweqwe(tertr213) ityorty(ret435)",
NA, "ogi(wqe685) qwe(ieow123)", "cvb(bnm567)"))
Final results:
index abc abc1 abc2 abc3
1 1 qwer(urt123) qweqwe urt123 <NA> <NA>
2 2 rte(ret390) qweqwe(tertr213) ityorty(ret435) ret390 tertr213 ret435
3 3 <NA> <NA> <NA> <NA>
4 4 ogi(wqe685) qwe(ieow123) wqe685 ieow123 <NA>
5 5 cvb(bnm567) bnm567 <NA> <NA>
The original data set has more than 10,000 lines and the number of brackets in the abc column could be more or less than 3.
Here is a base R solution
dfout <- cbind(df,
gsub("\\(|\\)",
"",
do.call(rbind,
lapply(z <- with(df,regmatches(abc,gregexpr("\\(\\w+\\)",abc))),
`length<-`,
max(lengths(z))))))
such that
> dfout
index abc 1 2 3
1 1 qwer(urt123) qweqwe urt123 <NA> <NA>
2 2 rte(ret390) qweqwe(tertr213) ityorty(ret435) ret390 tertr213 ret435
3 3 <NA> <NA> <NA> <NA>
4 4 ogi(wqe685) qwe(ieow123) wqe685 ieow123 <NA>
5 5 cvb(bnm567) bnm567 <NA> <NA>
Here is my attempt. I used regular expression to extract alphabets and numbers staying inside parentheses. stri_extract_all_regex() returns a list. So I used unnest_wider() to create new columns. The final step was to revise three column names. After useing unnest_wider() we get ...1 as a column name, for example. Any column names that contain ... got revised; I replaced ... with foo.
library(tidyverse)
library(stringi)
mutate(mydf,
foo = stri_extract_all_regex(str = abc, pattern = "(?<=\\()[[:alnum:]]+(?=\\))")) %>%
unnest_wider(foo) %>%
rename_at(vars(contains("...")),
.funs = list(~sub(x = ., pattern = "\\.+", replacement = "foo")))
index abc foo1 foo2 foo3
<int> <chr> <chr> <chr> <chr>
1 1 qwer(urt123) qweqwe urt123 NA NA
2 2 rte(ret390) qweqwe(tertr213) ityorty(ret435) ret390 tertr213 ret435
3 3 NA NA NA NA
4 4 ogi(wqe685) qwe(ieow123) wqe685 ieow123 NA
5 5 cvb(bnm567) bnm567 NA NA
DATA
mydf <- structure(list(index = 1:5, abc = c("qwer(urt123) qweqwe", "rte(ret390)
qweqwe(tertr213) ityorty(ret435)",
NA, "ogi(wqe685) qwe(ieow123)", "cvb(bnm567)")), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))

Unequal rows in list from unstack() - how to create a dataframe

I am (trying) to do a Robust ANOVA analysis in R. This requires that my two variables are in a very specific format. Basically, the requirement is to unstack two columns in my current dataframe and form an outcome frequency dataframe based on the predictor (categorical variable). This would usually happen automatically using the unstack() function i.e.
newDataFrame <- unstack(oldDataFrame, scores ~ columns)
However, the list returned has unequal rows for each category. Here is an example:
$A
[1] 2 4 2 3 3
$B
[1] 3 3
$C
[1] 5
$D
[1] 4 4 3
A, B, C and D are my categories, and the numbers are the outcome. The outcome has to be 1, 2, 3, 4, 5 or 6.
What I am working towards is the category as the 'header' and the outcome as a reference column, with the frequencies as the other columns, such that the dataframe looks like this:
A B C D
1 NA NA NA NA
2 2 NA NA NA
3 2 2 NA 1
4 1 NA NA 2
5 NA NA 1 NA
6 NA NA NA NA
What I have tried:
On another SO post, I found this -
library(stringi)
res <- as.data.frame(t(stri_list2matrix(myUnstackedList)))
colnames(res) <- unique(unlist(sapply(myUnstackedList, names)))
Outcome:
res
1 2 4 2 3 3
2 3 3 <NA> <NA> <NA>
3 5 <NA> <NA> <NA> <NA>
4 4 4 3 <NA> <NA>
Note that the categories A, B, C, D have been changed to 1, 2, 3, 4
Also tried this (another SO post):
df <- as.data.frame(plyr::ldply(myUnstackedList, rbind))
Outcome:
df
outcome group score
2 A 2
3 A 2
4 A 1
3 B 2
etc
Any tips?
This gets you most of the way to your answer:
test <- list(A=c(2,4,2,3,3),
B=c(3,3),
C=c(5),
D=c(4,4,3))
test <- lapply(1:length(test), function(i){
x <- data.frame(names(test)[i], test[i],
stringsAsFactors=FALSE)
names(x) <- c("ID", "Value")
x})
test <- bind_rows(test) %>% table %>% as.data.frame
test <- spread(test, key=ID, value=Freq)
replace(test, test==0, NA)
I'm not sure what the issue was with your previous dplyr attempt, however, I offer
library(tidyr)
library(dplyr)
df <- tibble(
outcome = c(1:5, 1:2, 1, 1:3),
group = c(rep("A", 5), rep("B", 2), "C", rep("D", 3)),
score = c(2, 4, 2, 3, 3, 3, 3, 5, 4, 4, 3)
)
df %>%
group_by(outcome) %>%
spread(group, score) %>%
ungroup() %>%
select(-outcome)
# # A tibble: 5 x 4
# A B C D
# * <dbl> <dbl> <dbl> <dbl>
# 1 2 3 5 4
# 2 4 3 NA 4
# 3 2 NA NA 3
# 4 3 NA NA NA
# 5 3 NA NA NA

How to get the top element per group with multiple columns?

I have the use-case shown below. Basically I have a data frame with three columns. I want to group by two columns (c1,c2) and sum the third one c3. Then I want to pick only the top 1 c1 with maximum c3 (among all c2) i.e. sorting would be unnecessary since I'm only interested in the max.
library(plyr)
df <- data.frame(c1=c('a','a','a','b','b','c'),c2=c('x','y','y','x','y','x'),c3=c(1,2,3,4,5,6))
df
c1 c2 c3
1 a x 1
2 a y 2
3 a y 3
4 b x 4
5 b y 5
6 c x 6
sel <- plyr::ddply(df, c('c1','c2'), plyr::summarize,c3=sum(c3))
sel[with(sel, order(c1,-c3)),]
c1 c2 c3
2 a y 5 <<< this one highest c3 for (c1,c2) combination
1 a x 1
4 b y 5 <<< this one highest c3 for (c1,c2) combination
3 b x 4
5 c x 6 <<< this one highest c3 for (c1,c2) combination
I could do this in a loop but I'm wondering how it can be done in a vector fashion or using a high-level function.
Here's a base R approach:
df2 <- aggregate(c3~c1+c2, df, sum)
subset(df2[order(-df2$c3),], !duplicated(c1))
# c1 c2 c3
#3 c x 6
#4 a y 5
#5 b y 5
Another solution from dplyr.
library(dplyr)
df2 <- df %>%
group_by(c1, c2) %>%
summarise(c3 = sum(c3)) %>%
filter(c3 == max(c3))
df2
# A tibble: 3 x 3
# Groups: c1 [3]
c1 c2 c3
<fctr> <fctr> <dbl>
1 a y 5
2 b y 5
3 c x 6
Here is another option with data.table
library(data.table)
setDT(df)[, .(c3 = sum(c3)) , .(c1, c2)][, .SD[which.max(c3)], .(c1)]
# c1 c2 c3
#1: a y 5
#2: b y 5
#3: c x 6
Using dplyr:
df %>%
group_by(c1, c2) %>%
summarise(c3 = sum(c3)) %>%
top_n(1, c3)
Or the last line can be slice(which.max(c3)), which will guarantee a single row.

merge two dataframes by nearest preceding date while aggregating

I am trying to match two datasets by nearest preceding date, by group.
So within a group, I would like to add the variables of a second dataset (d2) to that of the first (d1) when the date of the first is the nearest date on or before the date in the second. If two rows in the second dataset are matched with one row in the first I would like to add the larger of the values. (there will always be at least one date in d1 less then the date in d2, by group)
Here is an example, which hopefully makes it clearer
d1 = data.frame(id=c(1,1,1,2,2),
ref=as.Date(c("2013-12-07", "2014-12-07", "2015-12-07", "2013-11-07", "2014-11-07" )))
d1
# id ref
# 1 1 2013-12-07
# 2 1 2014-12-07
# 3 1 2015-12-07
# 4 2 2013-11-07
# 5 2 2014-11-07
d2 = data.frame(id=c(1,1,2),
date=as.Date(c("2014-05-07","2014-12-05", "2015-11-05")),
x1 = factor(c(1,2,2), ordered = TRUE),
x2 = factor(c(2, NA ,2), ordered=TRUE))
d2
# id date x1 x2
# 1 1 2014-05-07 1 2
# 2 1 2014-12-05 2 <NA>
# 3 2 2015-11-05 2 2
With the expected outcome
output = data.frame(id=c(1,1,1,2,2),
ref=as.Date(c("2013-12-07", "2014-12-07", "2015-12-07", "2013-11-07", "2014-11-07" )),
x1 = c(2, NA, NA, NA, 2),
x2 = c(2, NA, NA, NA, 2))
output
# id ref x1 x2
# 1 1 2013-12-07 2 2
# 2 1 2014-12-07 NA NA
# 3 1 2015-12-07 NA NA
# 4 2 2013-11-07 NA NA
# 5 2 2014-11-07 2 2
So for example, the first two observations of d2, id=1, with dates "2014-05-07","2014-12-05", are matched to the earlier date "2013-12-07" in d1. As there are two rows matched to one row in d1,
then the highest level is selected.
I could do this in base R by looping the following calculations through
each group but I was hoping for something more efficient.
I would love to see a data.table approach (but I am limited to R v3.1 and data.table v1.9.4). Thanks
real dataset:
d1: rows 1M / 100K groups
d2: rows 11K / 4K groups
# for one group
x = d1[d1$id==1, ]
y = d2[d2$id==1, ]
id = apply(outer(x$ref, y$date, "-"), 2, which.min)
temp = cbind(y, ref=x$ref[id])
# aggregate variables by ref
temp = merge(aggregate(x1 ~ ref, data=temp, max),
aggregate(x2 ~ ref, data=temp, max)
)
merge(x, temp, all=T)
ps: I had looked at How to match by nearest date from two data frames? and Join data.table on exact date or if not the case on the nearest less than date with no success.
You can do this using dplyr:
d2$ind <- 0
library(dplyr)
out <- d1 %>% full_join(d2,by=c("id","ref"="date")) %>%
arrange(id,ref) %>%
mutate(ind=cumsum(ifelse(is.na(ind),1,ind))) %>%
group_by(ind) %>%
summarise(ref=min(ref),x1=max(x1,na.rm=TRUE),x2=max(x2,na.rm=TRUE))
### A tibble: 5 x 4
## ind ref x1 x2
## <dbl> <date> <fctr> <fctr>
##1 1 2013-12-07 2 2
##2 2 2014-12-07 NA NA
##3 3 2015-12-07 NA NA
##4 4 2013-11-07 NA NA
##5 5 2014-11-07 2 2
We first add a column of indicators to d2 and set those to zero. Then, we perform a full outer join between d1 and d2. Those rows in d1 will have ind of NA. We sort by id and ref (i.e., the date), and we replace the NA entries of ind with 1 and perform a cumsum. This results in:
id ref x1 x2 ind
1 1 2013-12-07 <NA> <NA> 1
2 1 2014-05-07 1 2 1
3 1 2014-12-05 2 <NA> 1
4 1 2014-12-07 <NA> <NA> 2
5 1 2015-12-07 <NA> <NA> 3
6 2 2013-11-07 <NA> <NA> 4
7 2 2014-11-07 <NA> <NA> 5
8 2 2015-11-05 2 2 5
From this we can easily see that we can group by ind and summarise appropriately to get your result.

Resources