Insert dots in column names in wide data using R - r

The following data set is in the wide format and has repeated measures of "ql", "st" and "xy" prefixed by "a", "b" and "c";
df<-data.frame(id=c(1,2,3,4),
ex=c(1,0,0,1),
aql=c(5,4,NA,6),
bql=c(5,7,NA,9),
cql=c(5,7,NA,9),
bst=c(3,7,8,9),
cst=c(8,7,5,3),
axy=c(1,9,4,4),
cxy=c(5,3,1,4))
I'm looking for a way to insert dots after the prefixed letters "a", "b" and "c", while keeping other columns (i.e. id, ex) unchanged. I've been working around this using gsub function, e.g.
names(df) <- gsub("", "\\.", names(df))
but got undesired results. The expected output would look like
id ex a.ql b.ql c.ql b.st c.st a.xy c.xy
1 1 1 5 5 5 3 8 1 5
2 2 0 4 7 7 7 7 9 3
3 3 0 NA NA NA 8 5 4 1
4 4 1 6 9 9 9 3 4 4

Try
sub("(^[a-c])(.+)", "\\1.\\2", names(df))
# [1] "id" "ex" "a.ql" "b.ql" "c.ql" "b.st" "c.st" "a.xy" "c.xy"
or
sub("(?<=^[a-c])", ".", names(df), perl = TRUE)
# [1] "id" "ex" "a.ql" "b.ql" "c.ql" "b.st" "c.st" "a.xy" "c.xy"

You can do
setNames(df, sub("(ql$)|(st$)|(xy$)", "\\.\\1\\2\\3", names(df)))
#> id ex a.ql b.ql c.ql b.st c.st a.xy c.xy
#> 1 1 1 5 5 5 3 8 1 5
#> 2 2 0 4 7 7 7 7 9 3
#> 3 3 0 NA NA NA 8 5 4 1
#> 4 4 1 6 9 9 9 3 4 4

Another way you can try
library(dplyr)
df %>%
rename_at(vars(aql:cxy), ~ str_replace(., "(?<=\\w{1})", "\\."))
# id ex a.ql b.ql c.ql b.st c.st a.xy c.xy
# 1 1 1 5 5 5 3 8 1 5
# 2 2 0 4 7 7 7 7 9 3
# 3 3 0 NA NA NA 8 5 4 1
# 4 4 1 6 9 9 9 3 4 4

You can also try a tidyverse approach reshaping your data like this:
library(tidyverse)
#Data
df<-data.frame(id=c(1,2,3,4),
ex=c(1,0,0,1),
aql=c(5,4,NA,6),
bql=c(5,7,NA,9),
cql=c(5,7,NA,9),
bst=c(3,7,8,9),
cst=c(8,7,5,3),
axy=c(1,9,4,4),
cxy=c(5,3,1,4))
#Reshape
df %>% pivot_longer(-c(1,2)) %>%
mutate(name=paste0(substring(name,1,1),'.',substring(name,2,nchar(name)))) %>%
pivot_wider(names_from = name,values_from=value)
Output:
# A tibble: 4 x 9
id ex a.ql b.ql c.ql b.st c.st a.xy c.xy
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 5 5 5 3 8 1 5
2 2 0 4 7 7 7 7 9 3
3 3 0 NA NA NA 8 5 4 1
4 4 1 6 9 9 9 3 4 4

Related

Replace NA values when they are in two adjacent columns

Hi this is an example of a similar dataframe I am working with. I have an experiment with 10 samples and two replicates
df <- data.frame("ID" = c(1,2,3,4,5,6,7,8,9,10),
"Rep1" = c(6,5,3,"Na","Na",9,4,"Na","Na",2),
"Rep2" = c(8,4,4,"Na",3,"Na",6,"Na",2,1))
I have different Na values, however, I only want to replace them with zeros in the samples 4 and 8 due to they are the only ones which have NA in both replicates. Then, other samples would maintain the "NA".
You can also use the following solution. In the following solution we iterate over each row and detect corresponding index or indices that is (are) equal to Na then if there were more that one index we replace it with 0 otherwise the row will remain as it:
library(dplyr)
library(purrr)
df %>%
pmap_df(., ~ {ind <- which(c(...) == "Na");
if(length(ind) > 1) {
replace(c(...), ind, "0")
} else {
c(...)
}
}
) %>%
mutate(across(ID, as.integer))
# A tibble: 10 x 3
ID Rep1 Rep2
<int> <chr> <chr>
1 1 6 8
2 2 5 4
3 3 3 4
4 4 0 0
5 5 Na 3
6 6 9 Na
7 7 4 6
8 8 0 0
9 9 Na 2
10 10 2 1
P.S = I almost went crazy as why I could not get it to work only to realize your NAs are in fact Na.
We create an index where the 'Rep' columns are both "Na" with rowSums on a logical matrix. Use the row, column index/names to subset the data and assign the values to 0
nm1 <- grep("Rep", names(df), value = TRUE)
i1 <- rowSums(df[nm1] == "Na") == length(nm1)
df[i1, nm1] <- 0
-output
df
ID Rep1 Rep2
1 1 6 8
2 2 5 4
3 3 3 4
4 4 0 0
5 5 Na 3
6 6 9 Na
7 7 4 6
8 8 0 0
9 9 Na 2
10 10 2 1
As the OP created string "Na", the column types are not numeric. We can convert this to numeric as
df[-1] <- lapply(df[-1], as.numeric)
forces the "Na" to be converted to NA
-output
df
ID Rep1 Rep2
1 1 6 8
2 2 5 4
3 3 3 4
4 4 0 0
5 5 NA 3
6 6 9 NA
7 7 4 6
8 8 0 0
9 9 NA 2
10 10 2 1
With dplyr we could:
library(dplyr)
df %>%
mutate(across(starts_with("Rep"), ~case_when(.=="Na" & ID==4 | ID==8 ~ "0",
TRUE ~ .)))
Output:
ID Rep1 Rep2
1 1 6 8
2 2 5 4
3 3 3 4
4 4 0 0
5 5 Na 3
6 6 9 Na
7 7 4 6
8 8 0 0
9 9 Na 2
10 10 2 1
Though it has been marked as solved, yet I propose a simple answer
df <- data.frame("ID" = c(1,2,3,4,5,6,7,8,9,10),
"Rep1" = c(6,5,3,"Na","Na",9,4,"Na","Na",2),
"Rep2" = c(8,4,4,"Na",3,"Na",6,"Na",2,1))
library(dplyr)
df %>% group_by(ID) %>%
mutate(replace(cur_data(), all(cur_data() == 'Na'), '0'))
#> # A tibble: 10 x 3
#> # Groups: ID [10]
#> ID Rep1 Rep2
#> <dbl> <chr> <chr>
#> 1 1 6 8
#> 2 2 5 4
#> 3 3 3 4
#> 4 4 0 0
#> 5 5 Na 3
#> 6 6 9 Na
#> 7 7 4 6
#> 8 8 0 0
#> 9 9 Na 2
#> 10 10 2 1
OR
df %>% rowwise() %>%
mutate(replace(cur_data()[-1], all(cur_data()[-1] == 'Na'), '0'))

Retrieve a value by another column criteria in R

i need some help:
i got this df:
df <- data.frame(month = c(1,1,1,1,1,2,2,2,2,2),
day = c(1,2,3,4,5,1,2,3,4,5),
flow = c(2,5,7,8,5,4,6,7,9,2))
month day flow
1 1 1 2
2 1 2 5
3 1 3 7
4 1 4 8
5 1 5 5
6 2 1 4
7 2 2 6
8 2 3 7
9 2 4 9
10 2 5 2
but i want to know the day of min per month:
month day flow dayminflowofthemonth
1 1 1 2 1
2 1 2 5 1
3 1 3 7 1
4 1 4 8 1
5 1 5 5 1
6 2 1 4 5
7 2 2 6 5
8 2 3 7 5
9 2 4 9 5
10 2 5 2 5
this repetition is not a problem, i will use pivot fuction
tks people!
We can use which.min to return the index of 'min'imum 'flow' per group and use that to get the corresponding 'day' to create the column with mutate
library(dplyr)
df <- df %>%
group_by(month) %>%
mutate(dayminflowofthemonth = day[which.min(flow)]) %>%
ungroup
-output
df
# A tibble: 10 x 4
# month day flow dayminflowofthemonth
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 2 1
# 2 1 2 5 1
# 3 1 3 7 1
# 4 1 4 8 1
# 5 1 5 5 1
# 6 2 1 4 5
# 7 2 2 6 5
# 8 2 3 7 5
# 9 2 4 9 5
#10 2 5 2 5
Another option using indexing inside dplyr pipeline:
library(dplyr)
#Code
newdf <- df %>% group_by(month) %>% mutate(Val=day[flow==min(flow)][1])
Output:
# A tibble: 10 x 4
# Groups: month [2]
month day flow Val
<dbl> <dbl> <dbl> <dbl>
1 1 1 2 1
2 1 2 5 1
3 1 3 7 1
4 1 4 8 1
5 1 5 5 1
6 2 1 4 5
7 2 2 6 5
8 2 3 7 5
9 2 4 9 5
10 2 5 2 5
Here is a base R option using ave
transform(
df,
dayminflowofthemonth = ave(day*(ave(flow,month,FUN = min)==flow),month,FUN = max)
)
which gives
month day flow dayminflowofthemonth
1 1 1 2 1
2 1 2 5 1
3 1 3 7 1
4 1 4 8 1
5 1 5 5 1
6 2 1 4 5
7 2 2 6 5
8 2 3 7 5
9 2 4 9 5
10 2 5 2 5
One more base R approach:
df$dayminflowofthemonth <- by(
df,
df$month,
function(x) x$day[which.min(x$flow)]
)[df$month]

Is there an R function which can pass elements of lists as arguments without specifying individual elements

Is there an R function which can pass all the elements of a list as the arguments of a function?
library(tidyr)
a <- c(1,2,3)
b <- c(4,5,6)
c <- c(7,8,9)
d <- list(a,b,c)
crossing(d[[1]],d[[2]],d[[3]])
Instead of specifying d[[1]],d[[2]],d[[3]], i'd like to just include d
Expected result:
> crossing(d[[1]],d[[2]],d[[3]])
# A tibble: 27 x 3
`d[[1]]` `d[[2]]` `d[[3]]`
<dbl> <dbl> <dbl>
1 1 4 7
2 1 4 8
3 1 4 9
4 1 5 7
5 1 5 8
6 1 5 9
7 1 6 7
8 1 6 8
9 1 6 9
10 2 4 7
# ... with 17 more rows
You can use do.call to executes a function call and a list of arguments to be passed to it.
c(d[[1]],d[[2]],d[[3]])
#[1] 1 2 3 4 5 6 7 8 9
do.call("c", d)
#[1] 1 2 3 4 5 6 7 8 9
And for crossing, which needs not duplicated Column names:
library(tidyr)
names(d) <- seq_along(d)
do.call(crossing, d)
## A tibble: 27 x 3
# `1` `2` `3`
# <dbl> <dbl> <dbl>
# 1 1 4 7
# 2 1 4 8
# 3 1 4 9
# 4 1 5 7
# 5 1 5 8
# 6 1 5 9
# 7 1 6 7
# 8 1 6 8
# 9 1 6 9
#10 2 4 7
## … with 17 more rows

Count number of new and lost friends between two data frames in R

I have two data frames of the same respondents, one from Time 1 and the next from Time 2. In each wave they nominated their friends, and I want to know:
1) how many friends are nominated in Time 2 but not in Time 1 (new friends)
2) how many friends are nominated in Time 1 but not in Time 2 (lost friends)
Sample data:
Time 1 DF
ID friend_1 friend_2 friend_3
1 4 12 7
2 8 6 7
3 9 NA NA
4 15 7 2
5 2 20 7
6 19 13 9
7 12 20 8
8 3 17 10
9 1 15 19
10 2 16 11
Time 2 DF
ID friend_1 friend_2 friend_3
1 4 12 3
2 8 6 14
3 9 NA NA
4 15 7 2
5 1 17 9
6 9 19 NA
7 NA NA NA
8 7 1 16
9 NA 10 12
10 7 11 9
So the desired DF would include these columns (EDIT filled in columns):
ID num_newfriends num_lostfriends
1 1 1
2 1 1
3 0 0
4 0 0
5 3 3
6 0 1
7 0 3
8 3 3
9 2 3
10 2 1
EDIT2:
I've tried doing an anti join
df3 <- anti_join(df1, df2)
But this method doesn't take into account friend id numbers that might appear in a different column in time 2 (For example respondent #6 friend 9 and 19 are in T1 and T2 but in different columns in each time)
Another option:
library(tidyverse)
left_join(
gather(df1, key, x, -ID),
gather(df2, key, y, -ID),
by = c("ID", "key")
) %>%
group_by(ID) %>%
summarise(
num_newfriends = sum(!y[!is.na(y)] %in% x[!is.na(x)]),
num_lostfriends = sum(!x[!is.na(x)] %in% y[!is.na(y)])
)
Output:
# A tibble: 10 x 3
ID num_newfriends num_lostfriends
<int> <int> <int>
1 1 1 1
2 2 1 1
3 3 0 0
4 4 0 0
5 5 3 3
6 6 0 1
7 7 0 3
8 8 3 3
9 9 2 3
10 10 2 2
Simple comparisons would be an option
library(tidyverse)
na_sums_old <- rowSums(is.na(time1))
na_sums_new <- rowSums(is.na(time2))
kept_friends <- map_dbl(seq(nrow(time1)), ~ sum(time1[.x, -1] %in% time2[.x, -1]))
kept_friends <- kept_friends - na_sums_old * (na_sums_new >= 1)
new_friends <- 3 - na_sums_new - kept_friends
lost_friends <- 3 - na_sums_old - kept_friends
tibble(ID = time1$ID, new_friends = new_friends, lost_friends = lost_friends)
# A tibble: 10 x 3
ID new_friends lost_friends
<int> <dbl> <dbl>
1 1 1 1
2 2 1 1
3 3 0 0
4 4 0 0
5 5 3 3
6 6 0 1
7 7 0 3
8 8 3 3
9 9 2 3
10 10 2 2
You can make anti_join work by first pivoting to a "long" data frame.
df1 <- df1 %>%
pivot_longer(starts_with("friend_"), values_to = "friend") %>%
drop_na()
df2 <- df2 %>%
pivot_longer(starts_with("friend_"), values_to = "friend") %>%
drop_na()
head(df1)
#> # A tibble: 6 x 3
#> ID name friend
#> <int> <chr> <int>
#> 1 1 friend_1 4
#> 2 1 friend_2 12
#> 3 1 friend_3 7
#> 4 2 friend_1 8
#> 5 2 friend_2 6
#> 6 2 friend_3 7
lost_friends <- anti_join(df1, df2, by = c("ID", "friend"))
new_fiends <- anti_join(df2, df1, by = c("ID", "friend"))
respondents <- distinct(df1, ID)
respondents %>%
full_join(
count(lost_friends, ID, name = "num_lost_friends")
) %>%
full_join(
count(new_fiends, ID, name = "num_new_friends")
) %>%
mutate_at(vars(starts_with("num_")), replace_na, 0)
#> Joining, by = "ID"
#> Joining, by = "ID"
#> # A tibble: 10 x 3
#> ID num_lost_friends num_new_friends
#> <int> <dbl> <dbl>
#> 1 1 1 1
#> 2 2 1 1
#> 3 3 0 0
#> 4 4 0 0
#> 5 5 3 3
#> 6 6 1 0
#> 7 7 3 0
#> 8 8 3 3
#> 9 9 3 2
#> 10 10 2 2
Created on 2019-11-01 by the reprex package (v0.3.0)

R how to fill in NA with rules

data=data.frame(person=c(1,1,1,2,2,2,2,3,3,3,3),
t=c(3,NA,9,4,7,NA,13,3,NA,NA,12),
WANT=c(3,6,9,4,7,10,13,3,6,9,12))
So basically I am wanting to create a new variable 'WANT' which takes the PREVIOUS value in t and ADDS 3 to it, and if there are many NA in a row then it keeps doing this. My attempt is:
library(dplyr)
data %>%
group_by(person) %>%
mutate(WANT_TRY = fill(t) + 3)
Here's one way -
data %>%
group_by(person) %>%
mutate(
# cs = cumsum(!is.na(t)), # creates index for reference value; uncomment if interested
w = case_when(
# rle() gives the running length of NA
is.na(t) ~ t[cumsum(!is.na(t))] + 3*sequence(rle(is.na(t))$lengths),
TRUE ~ t
)
) %>%
ungroup()
# A tibble: 11 x 4
person t WANT w
<dbl> <dbl> <dbl> <dbl>
1 1 3 3 3
2 1 NA 6 6
3 1 9 9 9
4 2 4 4 4
5 2 7 7 7
6 2 NA 10 10
7 2 13 13 13
8 3 3 3 3
9 3 NA 6 6
10 3 NA 9 9
11 3 12 12 12
Here is another way. We can do linear interpolation with the imputeTS package.
library(dplyr)
library(imputeTS)
data2 <- data %>%
group_by(person) %>%
mutate(WANT2 = na.interpolation(WANT)) %>%
ungroup()
data2
# # A tibble: 11 x 4
# person t WANT WANT2
# <dbl> <dbl> <dbl> <dbl>
# 1 1 3 3 3
# 2 1 NA 6 6
# 3 1 9 9 9
# 4 2 4 4 4
# 5 2 7 7 7
# 6 2 NA 10 10
# 7 2 13 13 13
# 8 3 3 3 3
# 9 3 NA 6 6
# 10 3 NA 9 9
# 11 3 12 12 12
This is harder than it seems because of the double NA at the end. If it weren't for that, then the following:
ifelse(is.na(data$t), c(0, data$t[-nrow(data)])+3, data$t)
...would give you want you want. The simplest way, that uses the same logic but doesn't look very clever (sorry!) would be:
.impute <- function(x) ifelse(is.na(x), c(0, x[-length(x)])+3, x)
.impute(.impute(data$t))
...which just cheats by doing it twice. Does that help?
You can use functional programming from purrr and "NA-safe" addition from hablar:
library(hablar)
library(dplyr)
library(purrr)
data %>%
group_by(person) %>%
mutate(WANT2 = accumulate(t, ~.x %plus_% 3))
Result
# A tibble: 11 x 4
# Groups: person [3]
person t WANT WANT2
<dbl> <dbl> <dbl> <dbl>
1 1 3 3 3
2 1 NA 6 6
3 1 9 9 9
4 2 4 4 4
5 2 7 7 7
6 2 NA 10 10
7 2 13 13 13
8 3 3 3 3
9 3 NA 6 6
10 3 NA 9 9
11 3 12 12 12

Resources