how to build a variable to summarized muti variables - r

I have a data looks like this:
The sample data can be get by following codes:
ID<-c(1,1,1,1,2,2,2,3,3,3,4,4,4,4)
Days<-c(-5,1,18,30,1,8,16,1,8,6,-6,1,7,15)
Event_P<-c("","","P","","","","P","","","P","","","P","P")
Event_N<-c("","","","","N","","N","","","N","N","","N","N")
Event_C<-c("C","","C","","","","C","","","C","","","","")
Sample.data <- data.frame(ID, Days, Event_P, Event_N,Event_C)
I want to build a variable "Event" to capture all events. The final results will look like this:
What should I do? I would like to know as many ways as possible. Thanks.

One option could be using apply() like this. The suggestion from #AllanCameron is also a great choice. Here the code as option for you:
#Vectors
ID<-c(1,1,1,1,2,2,2,3,3,3,4,4,4,4)
Days<-c(-5,1,18,30,1,8,16,1,8,6,-6,1,7,15)
Event_P<-c("","","P","","","","P","","","P","","","P","P")
Event_N<-c("","","","","N","","N","","","N","N","","N","N")
Event_C<-c("C","","C","","","","C","","","C","","","","")
#Data
Sample.data <- data.frame(ID, Days, Event_P, Event_N,Event_C,stringsAsFactors = F)
#Option 1
index <- which(grepl('Event',names(Sample.data)))
Sample.data$Event <- apply(Sample.data[,index],1,function(x) paste0(x[x!=''],collapse='/'))
Output:
ID Days Event_P Event_N Event_C Event
1 1 -5 C C
2 1 1
3 1 18 P C P/C
4 1 30
5 2 1 N N
6 2 8
7 2 16 P N C P/N/C
8 3 1
9 3 8
10 3 6 P N C P/N/C
11 4 -6 N N
12 4 1
13 4 7 P N P/N
14 4 15 P N P/N

Duck's answer is very good, but you mentioned you want as many ways as possible so here are two more ways:
You could also use tidyverse's mutate and base r's interaction to combine the columns then use gsub to clear out all the unnecessary things:
ID<-c(1,1,1,1,2,2,2,3,3,3,4,4,4,4)
Days<-c(-5,1,18,30,1,8,16,1,8,6,-6,1,7,15)
Event_P<-c("","","P","","","","P","","","P","","","P","P")
Event_N<-c("","","","","N","","N","","","N","N","","N","N")
Event_C<-c("C","","C","","","","C","","","C","","","","")
Sample.data <- data.frame(ID, Days, Event_P, Event_N,Event_C)
library(tidyverse)
Sample.data %>%
mutate(Event = paste(Event_P, Event_N, Event_C, sep='/'),
Event = gsub('^/|^//|/$|//$', '', Event),
Event = gsub('//', '/', Event))
#> ID Days Event_P Event_N Event_C Event
#> 1 1 -5 C C
#> 2 1 1
#> 3 1 18 P C P/C
#> 4 1 30
#> 5 2 1 N N
#> 6 2 8
#> 7 2 16 P N C P/N/C
#> 8 3 1
#> 9 3 8
#> 10 3 6 P N C P/N/C
#> 11 4 -6 N N
#> 12 4 1
#> 13 4 7 P N P/N
#> 14 4 15 P N P/N
Sample.data$Event <-
interaction(Sample.data$Event_P, Sample.data$Event_N, Sample.data$Event_C, sep = '/') %>%
gsub('^/|^//|/$|//$', '', .) %>%
gsub('//', '/', .)
Sample.data
#> ID Days Event_P Event_N Event_C Event
#> 1 1 -5 C C
#> 2 1 1
#> 3 1 18 P C P/C
#> 4 1 30
#> 5 2 1 N N
#> 6 2 8
#> 7 2 16 P N C P/N/C
#> 8 3 1
#> 9 3 8
#> 10 3 6 P N C P/N/C
#> 11 4 -6 N N
#> 12 4 1
#> 13 4 7 P N P/N
#> 14 4 15 P N P/N
Created on 2020-09-18 by the reprex package (v0.3.0)
What inside the gsub(^/|^//|/$|//$) does is
^/|^//: Take out all / or // that start the string
/$|//$: Take out all / or // that end the string

Related

Replace all subsequent column values, after the first instance of a value greater than x

I have a dataframe (df1) with two columns, one (grp) is a grouping variable, the second (num) has some measurements.
For each group I want to:
replace all numbers greater than 3.5 with 4
replace all numbers after the first instance of 4 with 4
I just want to get to step 2, but step 1 seems like a logical starting point, maybe it isn't required though?
Example data
library(dplyr)
df1 <- data.frame(
grp = rep(c("a", "b"), each = 10),
num = c(0,1,2,5,0,1,7,0,2,1,2,2,2,2,5,0,0,0,0,6))
I can get the first part:
df1 %>%
group_by(grp) %>%
mutate(num = ifelse(num > 3.5, 4, num))
For the second part I tried using dplyr::lag and dplyr::case_when but no luck. Here is the desired output:
grp num
1 a 0
2 a 1
3 a 2
4 a 4
5 a 4
6 a 4
7 a 4
8 a 4
9 a 4
10 a 4
11 b 2
12 b 2
13 b 2
14 b 2
15 b 4
16 b 4
17 b 4
18 b 4
19 b 4
20 b 4
Any advice would be much appreciated.
You could use cumany() to find all cases after the first event, i.e. num > 3.5.
library(dplyr)
df1 %>%
group_by(grp) %>%
mutate(num2 = replace(num, cumany(num > 3.5), 4)) %>%
ungroup()
# A tibble: 20 × 3
grp num num2
<chr> <dbl> <dbl>
1 a 0 0
2 a 1 1
3 a 2 2
4 a 5 4
5 a 0 4
6 a 1 4
7 a 7 4
8 a 0 4
9 a 2 4
10 a 1 4
11 b 2 2
12 b 2 2
13 b 2 2
14 b 2 2
15 b 5 4
16 b 0 4
17 b 0 4
18 b 0 4
19 b 0 4
20 b 6 4
You can also replace cumany(num > 3.5) with cumsum(num > 3.5) > 0.

How can I extract information of one group based on the filtrates of another group in dplyr

My data frame looks like this but with thousands of entries
type <- rep(c("A","B","C"),4)
time <- c(0,0,0,1,1,1,2,2,2,3,3,3)
counts <- c(0,30,15,30,30,10,31,30,8,30,8,0)
df <- data.frame(time,type,counts)
df
time type counts
1 0 A 0
2 0 B 30
3 0 C 15
4 1 A 30
5 1 B 30
6 1 C 10
7 2 A 31
8 2 B 30
9 2 C 8
10 3 A 30
11 3 B 8
12 3 C 0
I want at each time point bigger than 0 to extract all the types that have counts==30
and then I want to extract for these types their counts at the next time point.
I want my data to look like this
time type counts time_after type_after counts_after
1 A 30 2 A 30
1 B 30 2 B 31
2 B 30 3 B 8
Any help or guidance are appreciated
Not very elegant but should do the job
library(dplyr)
type <- rep(c("A","B","C"),4)
time <- c(0,0,0,1,1,1,2,2,2,3,3,3)
counts <- c(0,30,15,30,30,10,31,30,8,30,8,0)
df <- tibble(time,type,counts)
df
#> # A tibble: 12 x 3
#> time type counts
#> <dbl> <chr> <dbl>
#> 1 0 A 0
#> 2 0 B 30
#> 3 0 C 15
#> 4 1 A 30
#> 5 1 B 30
#> 6 1 C 10
#> 7 2 A 31
#> 8 2 B 30
#> 9 2 C 8
#> 10 3 A 30
#> 11 3 B 8
#> 12 3 C 0
thirties <- df %>%
filter(counts == 30 & time != 0) %>%
mutate(time_after = time + 1)
inner_join(thirties, df, by = c("time_after" = "time",
"type" = "type")) %>%
select(time,
type = type,
counts = counts.x,
time_after,
type_after = type,
count_after = counts.y)
#> # A tibble: 3 x 6
#> time type counts time_after type_after count_after
#> <dbl> <chr> <dbl> <dbl> <chr> <dbl>
#> 1 1 A 30 2 A 31
#> 2 1 B 30 2 B 30
#> 3 2 B 30 3 B 8

Transforming wide data to long format with multiple variables

this may have a simple answer but after after a few hours of searching I still cannot find it. Basically I need to turn a wide dataset to a long format dataset but with multiple variables. My dataset structure looks like this:
df1 <- data.frame(id = c(1,2,3),
sex = c("M","F","M"),
day0s = c(21,25,15),
day1s = c(20,30,18),
day2s = c(18,18,17),
day0t = c(2,5,7),
day1t = c(3,6,5),
day2t = c(3,8,7))
df1
id sex day0s day1s day2s day0t day1t day2t
1 M 21 20 18 2 3 3
2 F 25 30 18 5 6 8
3 M 15 18 17 7 5 7
Basically 3 subjects have done a math test (s) and history test (t) every day for 3 days.
I tried to use gather from tidyr to turn it into long form, but I don't know how to assign the mt and ht variables to the same day. I also coded a new variable day with just day0 = 0, day1 = 1 and day2 = 2.
dfl <- df1 %>%
gather(day, value, - c(id,sex))
dfl
id sex variable value day
1 M day0s 21 0
1 M day1s 20 1
1 M day2s 18 2
1 M day0t 2 0
1 M day1t 3 1
1 M day2t 3 2
2 F day0s 25 0
2 F day1s 30 1
2 F day2s 18 2
2 F day0t 5 0
2 F day1t 6 1
2 F day2t 8 2
3 M day0s 15 0
3 M day1s 18 1
3 M day2s 17 2
3 M day0t 7 0
3 M day1t 5 1
3 M day2t 7 1
Ideally in the end it should look like this.
id sex day s t
1 M 0 21 2
1 M 1 20 3
1 M 2 18 3
2 F 0 25 5
2 F 1 30 6
2 F 2 18 8
3 M 0 15 7
3 M 1 18 5
3 M 2 17 7
Do you please have any suggestions on how to achieve this?
You can use {tidyr}'s pivot_longer here.
If your actual variables are named a bit differently, you can adapt the regex to your case. Here you can try out and adapt accordingly . (Note that in R the backslash has to be escaped, therefore the
double backslash in \\d+ and \\w+)
In general, the names_pattern argument works by matching the regex within the parenthesis with the names_to argument, so that here:
(\\d+) -> becomes variable day. Regex \d+ matches 1 or more digits.
(\\w+) -> becomes ".value". Regex \w+ matches 1 or more word character. Thanks to r2evans for pointing out the ".value" argument that spares one further reshape. The documentation states that .value "tells pivot_longer() that that part of the column name specifies the “value” being measured (which will become a variable in the output)." While I don't fully grasp the documentation explanation, the results are that the matching regex are mapped to the variable names in the output data.
library(dplyr)
library(tidyr)
df1 <- data.frame(id = c(1,2,3),
sex = c("M","F","M"),
day0mt = c(21,25,15),
day1mt = c(20,30,18),
day2mt = c(18,18,17),
day0ht = c(2,5,7),
day1ht = c(3,6,5),
day2ht = c(3,8,7))
df1
#> id sex day0mt day1mt day2mt day0ht day1ht day2ht
#> 1 1 M 21 20 18 2 3 3
#> 2 2 F 25 30 18 5 6 8
#> 3 3 M 15 18 17 7 5 7
df1 %>%
pivot_longer(cols = starts_with("day"),
names_pattern = "day(\\d+)(\\w+)",
names_to = c("day", ".value"))
#> # A tibble: 9 x 5
#> id sex day mt ht
#> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 1 M 0 21 2
#> 2 1 M 1 20 3
#> 3 1 M 2 18 3
#> 4 2 F 0 25 5
#> 5 2 F 1 30 6
#> 6 2 F 2 18 8
#> 7 3 M 0 15 7
#> 8 3 M 1 18 5
#> 9 3 M 2 17 7
Created on 2021-06-20 by the reprex package (v2.0.0)
Note that in newer versions of tidyr, gather and spread are deprecated and replaced by pivot_longer and pivot_wider.
Using the latest development-version of data.table (1.14.1) which adds some cool new melt-features..
use data.table::update.dev.pkg() for installation of the dev-version
library(data.table)
# data.table 1.14.1 IN DEVELOPMENT built 2021-06-22 09:38:23 UTC
dcast(
melt(setDT(df1), measure.vars = measure(day, type, pattern="^day(.)(.)")),
... ~ type, value.var = "value")
id sex day s t
1: 1 M 0 21 2
2: 1 M 1 20 3
3: 1 M 2 18 3
4: 2 F 0 25 5
5: 2 F 1 30 6
6: 2 F 2 18 8
7: 3 M 0 15 7
8: 3 M 1 18 5
9: 3 M 2 17 7
Here is a way. It first reshapes to long format, separates the day* column into day number and suffix columns and reshapes back to wide format.
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
pivot_longer(cols = starts_with("day")) %>%
mutate(day = str_extract(name, "\\d+"),
suffix = str_extract(name, "[^[:digit:]]+$")) %>%
select(-name) %>%
pivot_wider(
id_cols = -c(value, suffix),
names_from = suffix,
values_from = value
)
## A tibble: 9 x 5
# id sex day s t
# <dbl> <chr> <chr> <dbl> <dbl>
#1 1 M 0 21 2
#2 1 M 1 20 3
#3 1 M 2 18 3
#4 2 F 0 25 5
#5 2 F 1 30 6
#6 2 F 2 18 8
#7 3 M 0 15 7
#8 3 M 1 18 5
#9 3 M 2 17 7

Assign a random value - either 0 or 1 - to factors of a list

I have a list with 1000 factors, each ranging from 1 to 1000 and each factor appears 15 times. I want to either assign 0 or 1 to every factor that has the same value. For instance, factor 1 that appears 15 times has to have always the value 0. Any idea on how to do this? Basically, I would like to have two columns, one with the factors, and one with the value (0 or 1) that each factor has.
You could do:
my_binary <- as.numeric(my_factor) %% 2
So, for example:
df <- data.frame(number = 1:20, factor = rep(letters[1:5], 4))
df$binary <- as.numeric(df$factor) %% 2
Gives you
df
#> number factor binary
#> 1 1 a 1
#> 2 2 b 0
#> 3 3 c 1
#> 4 4 d 0
#> 5 5 e 1
#> 6 6 a 1
#> 7 7 b 0
#> 8 8 c 1
#> 9 9 d 0
#> 10 10 e 1
#> 11 11 a 1
#> 12 12 b 0
#> 13 13 c 1
#> 14 14 d 0
#> 15 15 e 1
#> 16 16 a 1
#> 17 17 b 0
#> 18 18 c 1
#> 19 19 d 0
#> 20 20 e 1
And if you want arbitrary numbers at a specified probability you would do:
numbers <- c(0, 1)
probs <- c(0.75, 0.25)
df <- data.frame(number = 1:20, factor = rep(letters[1:5], 4))
df$binary <- sample(numbers, length(levels(df$factor)), prob = probs, T)[as.numeric(df$factor)]

Randomizing conservative n rows of a dataframe in r

The problem I have is as explained in the title. I want to randomize the top, middle and bottom 3 rows in place. Here is a sample dataframe.
> set.seed(7)
> mydf
Id Name Score Feedback
1 1 AB 11 P
2 2 AA 12 P
3 3 AC 12 P
4 4 AD 31 P
5 5 AE 13 P
6 6 AF 15 P
7 7 AG 9 F
8 8 AH 8 F
9 9 AI 11 P
I could take the top, middle and last 3 rows independently and do a randomization and merge them back as follows:
# Take conservative 3 rows from mydf
top3 <- head(mydf,3)
middle3 <- mydf[4:6,]
tail3 <- tail(mydf,3)
# randomize the rows
top3r <- top3[sample(nrow(top3)),]
middle3r <- middle3[sample(nrow(middle3)),]
tail3r <- tail3[sample(nrow(tail3)),]
# merge them back
mydfr <- rbind(top3r, middle3r, tail3r)
> mydfr
Id Name Score Feedback
2 2 AA 12 P
1 1 AB 11 P
3 3 AC 12 P
6 6 AF 15 P
4 4 AD 31 P
5 5 AE 13 P
7 7 AG 9 F
8 8 AH 8 F
9 9 AI 11 P
Is there someway I could achieve the same without going through the manual process of pulling the n rows?
Thank you,
This is basically the same as your code, but without all the intermediate variables.
mydf[c(sample(1:3), sample(4:6), sample(7:9)), ]
Here is a way it could be done if you wanted to use dplyr (I do like the base solution by #Gregor in the comments though).
library(dplyr)
set.seed(1)
mydf %>%
mutate(grp = rep(1:3, each = 3)) %>%
group_by(grp) %>%
sample_n(3)
#> # A tibble: 9 x 5
#> # Groups: grp [3]
#> Id Name Score Feedback grp
#> <int> <chr> <int> <chr> <int>
#> 1 1 AB 11 P 1
#> 2 3 AC 12 P 1
#> 3 2 AA 12 P 1
#> 4 6 AF 15 P 2
#> 5 4 AD 31 P 2
#> 6 5 AE 13 P 2
#> 7 9 AI 11 P 3
#> 8 8 AH 8 F 3
#> 9 7 AG 9 F 3

Resources