Assign consecutive trial numbers to data frame beginning at match - r

I want to assign consecutive trial numbers (1-16) to a long dataframe depending on certain values of other variables.
It should look like this (simplified):
value trial_no
videoA 1
other 1
videoB 2
other 2
other 2
videoC 3
...
This basically does what I want, but it just assigns the row numbers.
df2 <- df1 %>%
mutate(trial_no = case_when(grepl('video', value) ~ row_number())) %>%
fill(trial_no)
This might do what I want, but yet it assigns 16 to all.
for (vid in c(1:16)) {
df2 <- df1 %>%
mutate(trial_no = case_when(grepl("video", value) ~ vid)) %>%
fill(trial_no)
}
I'm pretty sure there is an easy solution to this.
Any help is very much appreciated.

Using grepl and count the TRUEs
transform(dat, trial_no=cumsum(grepl('video', value)))
# value trial_no
# 1 videoA 1
# 2 other 1
# 3 videoB 2
# 4 other 2
# 5 other 2
# 6 videoC 3
Data:
dat <- structure(list(value = c("videoA", "other", "videoB", "other",
"other", "videoC")), class = "data.frame", row.names = c(NA,
-6L))

Related

Counting all elements, row wise in list column

My dataframe looks like this:
V1
c("cheese","bread","sugar","cream","milk","butter")
c("milk","butter","apples","cream","bread")
c("butter","milk","toffee")
c("cream","milk","butter","sugar")
I am trying to count the number of times each element appears and sum in a new column. I would like to end up with something like this:
V2 V3
cheese 1
bread 2
sugar 2
cream 3
milk 4
butter 4
apples 1
toffee 1
I have tried using the following code
counts <- unlist(V1, use.names = FALSE)
counts <- table(counts)
But for some reason the counts are wrong and values are being skipped.
If I understand you correctly and your data is organized as provided below, then we could do it this way:
Using separate_rows will allow to bring all your strings in one row.
remove c and empty rows
Use fct_inorder from forcats package (it is in tidyverse) to keep the order as provided
then apply count with the name argument:
library(tidyverse)
df %>%
separate_rows(V1) %>%
filter(!(V1 == "c" | V1 == "")) %>%
mutate(V1 = fct_inorder(V1)) %>%
count(V1, name ="V3")
V1 V3
<fct> <int>
1 cheese 1
2 bread 2
3 sugar 2
4 cream 3
5 milk 4
6 butter 4
7 apples 1
8 toffee 1
df <- structure(list(V1 = c("c(\"cheese\",\"bread\",\"sugar\",\"cream\",\"milk\",\"butter\")",
"c(\"milk\",\"butter\",\"apples\",\"cream\",\"bread\")", "c(\"butter\",\"milk\",\"toffee\")",
"c(\"cream\",\"milk\",\"butter\",\"sugar\")")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -4L))
A couple of little issues with the question. Found it hard to reproduce exactly so took some liberties with the DF and present a couple of options that might help:
Option 1 - data in one column
library(tidyverse)
df <- data.frame(V1 = c("cheese","bread","sugar","cream","milk","butter",
"milk","butter","apples","cream","bread",
"butter","milk","toffee",
"cream","milk","butter","sugar"))
df <- df %>% dplyr::group_by(V1) %>%
summarise(
V3 = n()
)
Option 2 - data in columns - added NAs so it made a DF
library(tidyverse)
df <- data.frame(c("cheese","bread","sugar","cream","milk","butter"),
c("milk","butter","apples","cream","bread",NA),
c("butter","milk","toffee",NA,NA,NA),
c("cream","milk","butter","sugar",NA,NA))
df <- data.frame(V1=unlist(df)) %>%
select(V1) %>%
drop_na() %>%
group_by(V1) %>%
summarise(V3 = n())
hope this helps!

Is there a way to group_by values in R?

I have a dataframe that looks something like this
Column A
Column B
Accepted
Did not accept
Did not accept
Did not accept
and so on..
Just wanted to know if there is a way to manipulate the data into a table visualisation which looks something like this?: To count the number of Accepted and Did not accept for each column.
Accepted or not?
Column A
Column B
Accepted
10
2
Did not accept
5
5
In base R, use sapply with table -
sapply(df,function(x) table(factor(x, levels = c('Accepted', 'Did not accept'))))
# Column.A Column.B
#Accepted 1 0
#Did not accept 1 2
In tidyverse -
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = everything()) %>%
pivot_wider(names_from = name, values_from = name,
values_fn = length, values_fill = 0)
# value Column.A Column.B
# <chr> <int> <int>
#1 Accepted 1 0
#2 Did not accept 1 2
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(Column.A = c("Accepted", "Did not accept"), Column.B = c("Did not accept",
"Did not accept")), row.names = c(NA, -2L), class = "data.frame")
Convert wide-to-long, then tabulate:
table(stack(df))
# ind
# values Column.A Column.B
# Accepted 1 0
# Did not accept 1 2

appending values from a look-up table to columns of another data frame based on trailing zero patterns

Data frame dat includes a set of numeric ids in a vector called code_num. Some of these ids end with one or more zeros. Others do not. Here are the first three lines:
code_num X1 X2 X3 … X50
251000 NA NA NA NA
112020 NA NA NA NA
537199 NA NA NA NA
The full data of dat are in the first tab of this google sheet.
Another data frame lut includes another set of numeric ids called code_num_moredetail that need to be associated with the higher-level identifiers in dat. Here are seven example observations of lut:
code_num_moredetail
251000.99
251743.00
251222.02
112020.01
112029.01
537119.00
537119.99
The full data of lut are in the second tab of this google sheet.
The trailing zeros in dat$code_num are wild card digits. Any value of lut$code_num_moredetail that match the numbers preceding the trailing zeros of dat$code_num should be considered a matching value, and needs to be added to the ith value of dat$X1 through dat$X50 (or beyond - I'm not certain how many matches to expect).
Consider two example cases:
if dat$code_num = 999000, then every value of lut$code_num_moredetail that matched the pattern 999###.## would need to be inserted into the columns that begin with the letter X in dat.
if dat$code_num = 999090 then every value of lut$code_num_moredetail that matched the pattern 99909#.## would need to be inserted into the columns that begin with the letter X in dat.
Using only the values provided in the example data frames, the final solution would make dat look like this:
code_num X1 X2 X3
251000 251000.99 251743.00 251222.02
112020 112020.01 112029.01 NA
537199 537119.00 537119.99 NA
I'm seeking an efficient way to augment dat with all wild-card-matched values of lut.
Note: some values of dat$code_num may not match any value of lut$code_num_moredetail - a proper solution must accommodate i matches, where i can range from 0 to 50.
Try
library(dplyr)
library(tidyr)
library(data.table)
library(stringr)
out <- lut %>%
mutate(new = substr(code_num_moredetail, 1, 3)) %>%
left_join(dat %>%
transmute(code_num, new = substr(code_num, 1, 3))) %>%
mutate(rn = str_c("X", rowid(new))) %>%
pivot_wider(names_from = rn, values_from = code_num_moredetail) %>%
select(-new)
-output
out
# A tibble: 3 x 4
code_num X1 X2 X3
<int> <dbl> <dbl> <dbl>
1 251000 251001. 251743 251222.
2 112020 112020. 112029. NA
3 537199 537119 537120. NA
The digits are in the data. It is just the tibble print
print(out$X3, digits = 10)
[1] 251222.02 NA NA
Or may be
library(fuzzyjoin)
dat1 <- dat %>%
transmute(code_num, new = sub("0+$", "", code_num))
lut$new <- str_replace(sub("\\..*", "", sprintf('%.2f', lut[[1]])),
paste0(".*(", paste(dat1$new, collapse="|"), ").*"), "\\1")
stringdist_left_join(lut, dat1) %>%
select(code_num_moredetail, code_num, new = new.x) %>%
mutate(rn = str_c("X", rowid(new))) %>%
pivot_wider(names_from = rn, values_from = code_num_moredetail) %>%
select(-new)
-output
# A tibble: 3 x 4
code_num X1 X2 X3
<int> <dbl> <dbl> <dbl>
1 251000 251001. 251743 251222.
2 112020 112020. 112029. NA
3 537199 537119 537120. NA
data
lut <- structure(list(code_num_moredetail = c(251000.99, 251743, 251222.02,
112020.01, 112029.01, 537119, 537119.99)), row.names = c(NA,
-7L), class = "data.frame")
dat <- structure(list(code_num = c(251000L, 112020L, 537199L),
X1 = c(NA,
NA, NA), X2 = c(NA, NA, NA), X3 = c(NA, NA, NA)), class = "data.frame",
row.names = c(NA,
-3L))

Gather twice in same data frame

I have a dataframe where I want to do two separate gathers
library(tidyverse)
id <- c("A","B","C","D","E")
test_1_baseline <- c(1,2,4,5,6)
test_2_baseline <- c(21000, 23400, 26800,29000,30000)
test_1_followup <- c(0,4,2,3,1)
test_2_followup <- c(10000,12000,13000,15000,21000)
layout_1 <-data.frame(id,test_1_baseline,test_1_followup,test_2_baseline,test_2_followup)
This is the current layout.
Each person is 1 line.
The result of Test 1 at baseline is one variable
The result of Test 2 at baseline is a second variable
The same applies to Test 1/2 follow-up results
I would like the data to be tidier. One column for timepoint, one for result of test A, one for result of test B.
id2 <- c("A","B","C","D","E","A","B","C","D","E")
time <- c(rep("baseline",5),rep("followup",5))
test_1_result <- c(1,2,4,5,6,0,4,2,3,1)
test_2_result <- c(21000, 23400, 26800,29000,30000,10000,12000,13000,15000,21000)
layout_2 <- data.frame(id2, time,test_1_result,test_2_result)
I'm currently doing a what seems to me odd process where first of all I gather the test 1 data
test_1 <- select(layout_1,id,test_1_baseline,test_1_followup) %>%
gather("Timepoint","test_1",c(test_1_baseline,test_1_followup)) %>%
mutate(Timepoint = replace(Timepoint,Timepoint=="test_1_baseline", "baseline")) %>%
mutate(Timepoint = replace(Timepoint,Timepoint=="test_1_followup", "followup"))
Then I do same for test 2 and join them
test_2 <- select(layout_1,id,test_2_baseline,test_2_followup) %>%
gather("Timepoint","test_2",c(test_2_baseline,test_2_followup)) %>%
mutate(Timepoint = replace(Timepoint,Timepoint=="test_2_baseline", "baseline")) %>%
mutate(Timepoint = replace(Timepoint,Timepoint=="test_2_followup", "followup"))
test_combined <- full_join(test_1,test_2)
I tried doing the first Gather and then the second on the same dataframe but then you end up with duplicates; i.e you end up with
ID 1 Test_1 Baseline Test_2 Baseline
ID 1 Test_1 Baseline Test_2 Followup
ID 1 Test_1 Followup Test_2
Baseline ID 1 Test_1 Followup Test_2 Followup
== 4 rows where there should only be 2
I feel there must be a cleaner tidyverse way to do this.
Guidance welcomed
One option with data.table using melt which can take multiple measure patterns
library(data.table)
nm1 <- unique(sub(".*_", "", names(layout_1)[-1]))
melt(setDT(layout_1), measure = patterns("test_1", "test_2"),
value.name = c('test_1_result', 'test_2_result'),
variable.name = 'time')[, time := nm1[time]][]
You could gather all columns except id, then use separate to split into result and time.
Note that this code assumes that the result name is always 6 characters (test_1, test_2), and separates based on that assumption. You'll need to devise a different separate if that is not the case.
library(tidyr)
library(dplyr)
layout_1 %>%
gather(Var, Val, -id) %>%
separate(Var, into = c("result", "time"), sep = 6) %>%
spread(result, Val) %>%
mutate(time = gsub("_", "", time))
Result:
id time test_1 test_2
1 A baseline 1 21000
2 A followup 0 10000
3 B baseline 2 23400
4 B followup 4 12000
5 C baseline 4 26800
6 C followup 2 13000
7 D baseline 5 29000
8 D followup 3 15000
9 E baseline 6 30000
10 E followup 1 21000

Separating Column Based on First Value of String

I have an ID variable that I am trying to separate into two separate columns based on their prefix being either a 1 or 2.
An example of my data is:
STR_ID
1434233
2343535
1243435
1434355
I have tried countless ways to try to separate these variables into columns based on their prefixes, but cannot seem to figure it out. Any ideas on how I would do this? Thank you in advance.
We create a grouping variable with substr by extracting the first character/digit of 'STR_ID', and spread it to 'wide' format
library(tidyverse)
df1 %>%
group_by(grp = paste0('grp', substr(STR_ID, 1, 1))) %>%
mutate(i = row_number()) %>%
spread(grp, STR_ID) %>%
select(-i)
# A tibble: 3 x 2
# grp1 grp2
# <int> <int>
#1 1434233 2343535
#2 1243435 NA
#3 1434355 NA
data
df1 <- structure(list(STR_ID = c(1434233L, 2343535L, 1243435L, 1434355L
)), class = "data.frame", row.names = c(NA, -4L))

Resources