Counting where the string appears in row with caveat - [R]

Counting where the string appears in row with caveat - [R] - r

I have a dataset where the first three columns (G1P1, G1P2, G1P3) indicate one grouping of three individuals (i.e. Sidney, Blake, Max on Row 1), the second three columns (G2P1, G2P2, G2P3) indicate another grouping of three individuals (i.e. David, Steve, Daniel on Row 2), etc.... There are a total of 12 individuals, and dataset is pretty much all the possible groupings of these 12 people (approximately 300,000 rows). Each group's cumulative test scores are represented on far right columns, (G1.Sum, G2.Sum, G3.Sum, G4.Sum
).
#### The dput(data) of the first five rows ####
data <- structure(list(X = 1:5, G1P1 = structure(c(4L, 4L, 4L, 4L, 4L), .Label = c("CHRIS", "DAVID", "EVA", "SIDNEY"), class = "factor"), G1P2 = structure(c(1L, 1L, 1L, 1L, 1L), .Label = c("BLAKE", "NICK", "PATRIC", "STEVE"), class = "factor"), G1P3 = structure(c(4L, 4L, 4L, 4L, 4L), .Label = c("BEAU", "BRANDON", "DANIEL", "MAX"), class = "factor"), G2P1 = structure(c(2L, 2L, 1L, 1L, 3L), .Label = c("CHRIS", "DAVID", "EVA", "SIDNEY"), class = "factor"), G2P2 = structure(c(4L, 4L, 3L, 3L, 2L), .Label = c("BLAKE", "NICK", "PATRIC", "STEVE"), class = "factor"), G2P3 = structure(c(3L, 3L, 2L, 2L, 1L), .Label = c("BEAU", "BRANDON", "DANIEL", "MAX"), class = "factor"), G3P1 = structure(c(1L, 3L, 2L, 3L, 2L), .Label = c("CHRIS", "DAVID", "EVA", "SIDNEY"), class = "factor"), G3P2 = structure(c(3L, 2L, 4L, 2L, 4L), .Label = c("BLAKE", "NICK", "PATRIC", "STEVE"), class = "factor"), G3P3 = structure(c(2L, 1L, 3L, 1L, 3L), .Label = c("BEAU", "BRANDON", "DANIEL", "MAX"), class = "factor"), G4P1 = structure(c(3L, 1L, 3L, 2L, 1L), .Label = c("CHRIS", "DAVID", "EVA", "SIDNEY"), class = "factor"), G4P2 = structure(c(2L, 3L, 2L, 4L, 3L), .Label = c("BLAKE", "NICK", "PATRIC", "STEVE"), class = "factor"), G4P3 = structure(c(1L, 2L, 1L, 3L, 2L), .Label = c("BEAU", "BRANDON", "DANIEL", "MAX"), class = "factor"), G1.Sum = c(63.33333333, 63.33333333, 63.33333333, 63.33333333, 63.33333333), G2.Sum = c(58.78333333, 58.78333333, 54.62333333, 54.62333333, 58.69), G3.Sum = c(54.62333333, 58.69, 58.78333333, 58.69, 58.78333333), G4.Sum = c(58.69, 54.62333333, 58.69, 58.78333333, 54.62333333)), .Names = c("X", "G1P1", "G1P2", "G1P3", "G2P1", "G2P2", "G2P3", "G3P1", "G3P2", "G3P3", "G4P1", "G4P2", "G4P3", "G1.Sum", "G2.Sum", "G3.Sum", "G4.Sum"), row.names = c(NA, 5L), class = "data.frame")
I was wondering how you would write an R function so for each row, you can record where the person's group score ranked. For example, on Row 1, SIDNEY was in a group with the highest score at 63.3333. So his rank would be a '1.' But for BRANDON, his group scored last (54.62333), so her rank would be 4. I would like my final data.frame output to be something like this:

ranks <- t(apply(data[grep("Sum", names(data))], 1, function(x) rep(match(x, sort(x, decreasing=T)),each=3)))
just.names <- data[grep("P", names(data))] #Subset without sums
names <- as.character(unlist(just.names[1,])) #create name vector
sapply(names, function(x) ranks[just.names == x])
# SIDNEY BLAKE MAX DAVID STEVE DANIEL CHRIS PATRIC BRANDON EVA NICK BEAU
# [1,] 1 1 1 2 2 2 4 4 4 3 3 3
# [2,] 1 1 1 2 2 2 4 4 4 3 3 3
# [3,] 1 1 1 2 2 2 4 4 4 3 3 3
# [4,] 1 1 1 2 2 2 4 4 4 3 3 3
# [5,] 1 1 1 2 2 2 4 4 4 3 3 3
We first rank the sums and replicate them 3 times each. Next we subset the larger data frame with the names only (take out the sum columns). We make a vector with the individual names. And lastly, we subset the ranks matrix that we created first by seeing where in the data frame the name appears.

Using dplyr and tidyr. First, ranking, then uniting all the rows with their rank, converting to long data, separating out the variables, then finally spreading.
It got really long, and can probably be simplified:
library(dplyr)
library(tidyr)
data[ ,14:17] <- t(apply(-data[ ,14:17], 1 , rank))
data %>% unite("g1", starts_with("G1")) %>%
unite("g2", starts_with("G2")) %>%
unite("g3", starts_with("G3")) %>%
unite("g4", starts_with("G4")) %>%
gather(Row, val, -X) %>%
select(-Row) %>%
separate(val, c("1", "2", "3", "rank")) %>%
gather(zzz, name, -X, -rank) %>%
select(-zzz) %>%
spread(name, rank)
X BEAU BLAKE BRANDON CHRIS DANIEL DAVID EVA MAX NICK PATRIC SIDNEY STEVE
1 1 3 1 4 4 2 2 3 1 3 4 1 2
2 2 3 1 4 4 2 2 3 1 3 4 1 2
3 3 3 1 4 4 2 2 3 1 3 4 1 2
4 4 3 1 4 4 2 2 3 1 3 4 1 2
5 5 3 1 4 4 2 2 3 1 3 4 1 2

Using previous answer's 'rank' matrix and library(reshape2) to convert wide data.frame to long data.frame,
ranks <- t(apply(test[grep("Sum",names(test))], 1, function (x)
rep(match(x, sort(x, decreasing=T)),each=3)))
colnames(ranks) <- names(test)[grep("P", names(test))]
# data subset
test_L <- test[,-grep("Avg", names(test))]
df_player <- data.frame(position= names(test)[grep("P", names(test))],
t(test_L[,-1]), row.names = NULL)
df_ranks <- data.frame(position=names(test)[grep("P", names(test))],
t(ranks), row.names=NULL)
# Combine two temporary data.frames
df_player_melted <- melt(df_player, id=1,
variable.name = "rowNumber", value.name = "player")
df_ranks_melted <- rank= melt(df_ranks, id=1,
variable.name = "rowNumber", value.name = "rank")
df <- cbind(df_player_melted, rank= df_ranks_melted$rank)
# cast into the output format you want
df <- dcast(df, rowNumber ~ player + rank)[1,]

Related

Degrouping from one row per group to one row per subject

I have data where each row represents a household, and I would like to have one row per individual in the different households.
The data looks similar to this:
df <- data.frame(village = rep("aaa",5),household_ID = c(1,2,3,4,5),name_1 = c("Aldo","Giovanni","Giacomo","Pippo","Pippa"),outcome_1 = c("yes","no","yes","no","no"),name_2 = c("John","Mary","Cindy","Eva","Doron"),outcome_2 = c("yes","no","no","no","no"))
I would still like to keep the wide format of the data, just with one individual (and related outcome variables) per row. I could find examples that tell how to do the opposite, going from individual to grouped data using dcast, but I could not find examples of this problem I am facing now.
I have tried with melt
reshape2::melt(df, id.vars = "household_ID")
but I get a long format data.
Any suggestions welcome...
Thank you

Use pivot_longer() in tidyr, and set ".value" in names_to to indicate new column names from the pattern of the original column names.
library(tidyr)
df %>%
pivot_longer(-c(village, household_ID),
names_to = c(".value", "n"),
names_sep = "_")
# # A tibble: 10 x 5
# village household_ID n name outcome
# <fct> <dbl> <chr> <fct> <fct>
# 1 aaa 1 1 Aldo yes
# 2 aaa 1 2 John yes
# 3 aaa 2 1 Giovanni no
# 4 aaa 2 2 Mary no
# 5 aaa 3 1 Giacomo yes
# 6 aaa 3 2 Cindy no
# 7 aaa 4 1 Pippo no
# 8 aaa 4 2 Eva no
# 9 aaa 5 1 Pippa no
# 10 aaa 5 2 Doron no
Data
df <- structure(list(village = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "aaa", class = "factor"),
household_ID = c(1, 2, 3, 4, 5), name_1 = structure(c(1L,
3L, 2L, 5L, 4L), .Label = c("Aldo", "Giacomo", "Giovanni",
"Pippa", "Pippo"), class = "factor"), outcome_1 = structure(c(2L,
1L, 2L, 1L, 1L), .Label = c("no", "yes"), class = "factor"),
name_2 = structure(c(4L, 5L, 1L, 3L, 2L), .Label = c("Cindy",
"Doron", "Eva", "John", "Mary"), class = "factor"), outcome_2 = structure(c(2L,
1L, 1L, 1L, 1L), .Label = c("no", "yes"), class = "factor")), class = "data.frame", row.names = c(NA, -5L))

subsetting data based with the condition of the current and previous entity in r

I have data with the status column. I want to subset my data to the condition of 'f' status, and previous condition of 'f' status.
to simplify:
df
id status time
1 n 1
1 n 2
1 f 3
1 n 4
2 f 1
2 n 2
3 n 1
3 n 2
3 f 3
3 f 4
my result should be:
id status time
1 n 2
1 f 3
2 f 1
3 n 2
3 f 3
3 f 4
How can I do this in R?

Here's a solution using dplyr -
df %>%
group_by(id) %>%
filter(status == "f" | lead(status) == "f") %>%
ungroup()
# A tibble: 6 x 3
id status time
<int> <fct> <int>
1 1 n 2
2 1 f 3
3 2 f 1
4 3 n 2
5 3 f 3
6 3 f 4
Data -
df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
status = structure(c(2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L,
1L), .Label = c("f", "n"), class = "factor"), time = c(1L,
2L, 3L, 4L, 1L, 2L, 1L, 2L, 3L, 4L)), .Names = c("id", "status",
"time"), class = "data.frame", row.names = c(NA, -10L))

R - dplyr map slice for repeat rows

I have trouble combining slice and map.
I am interested of doing something similar to this; which is, in my case, transforming a compact person-period file to a long (sequential) person-period one. However, because my file is too big, I need to split the data first.
My data look like this
group id var ep dur
1 A 1 a 1 20
2 A 1 b 2 10
3 A 1 a 3 5
4 A 2 b 1 5
5 A 2 b 2 10
6 A 2 b 3 15
7 B 1 a 1 20
8 B 1 a 2 10
9 B 1 a 3 10
10 B 2 c 1 20
11 B 2 c 2 5
12 B 2 c 3 10
What I need is simply this (answer from this)
library(dplyr)
dt %>% slice(rep(1:n(),.$dur))
However, I am interested in introducing a split(.$group).
How I am suppose to do so ?
dt %>% split(.$group) %>% map_df(slice(rep(1:n(),.$dur)))
Is not working for example.
My desired output is the same as dt %>% slice(rep(1:n(),.$dur))
which is
group id var ep dur
1 A 1 a 1 20
2 A 1 a 1 20
3 A 1 a 1 20
4 A 1 a 1 20
5 A 1 a 1 20
6 A 1 a 1 20
7 A 1 a 1 20
8 A 1 a 1 20
9 A 1 a 1 20
10 A 1 a 1 20
.....
But I need to split this operation because the file is too big.
data
dt = structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L,
2L, 2L), .Label = c("1", "2"), class = "factor"), var = structure(c(1L,
2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 3L, 3L, 3L), .Label = c("a",
"b", "c"), class = "factor"), ep = structure(c(1L, 2L, 3L,
1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1", "2",
"3"), class = "factor"), dur = c(20, 10, 5, 5, 10, 15, 20,
10, 10, 20, 5, 10)), .Names = c("group", "id", "var", "ep",
"dur"), row.names = c(NA, -12L), class = "data.frame")

map takes two arguments: a vector/list in .x and a function in .f. It then applies .f on all elements in .x.
The function you are passing to map is not formatted correctly. Try this:
f <- function(x) x %>% slice(rep(1:n(), .$dur))
dt %>%
split(.$group) %>%
map_df(f)
You could also use it like this:
dt %>%
split(.$group) %>%
map_df(slice, rep(1:n(), dur))
This time you directly pass the slice function to map with additional parameters.

I'm not quite sure what your desired final output is, but you could use tidyr to nest the data that you want to repeat and a simple function to expand levels of your nested data, very similar to Tutuchan's answer.
expand_df <- function(df, repeats) {
df %>% slice(rep(1:n(), repeats))
}
dt %>%
tidyr::nest(var:ep) %>%
mutate(expanded = purrr::map2(data, dur, expand_df)) %>%
select(-data) %>%
tidyr::unnest()
Tutuchan's answer gives exactly the same output as your original approach - is that what you were looking for? I don't know if it will have any advantage over your original method.

R - how to avoid repeating filter & row bind

Because I am working on a very large dataset, I need to slice my dataset by groups in order to pursue my computations.
I have a person-period (melt) dataset that looks like this
group id var time
1 A 1 a 1
2 A 1 b 2
3 A 1 a 3
4 A 2 b 1
5 A 2 b 2
6 A 2 b 3
7 B 1 a 1
8 B 1 a 2
9 B 1 a 3
10 B 2 c 1
11 B 2 c 2
12 B 2 c 3
I need to do this simple transformation
library(reshape2)
library(dplyr)
dt %>% dcast(group + id ~ time, value.var = 'var')
In order to get
group id 1 2 3
1 A 1 a b a
2 A 2 b b b
3 B 1 a a a
4 B 2 c c c
So far, so good.
However, because my database is too big, I need to do this separately for each different groups, such as
a = dt %>% filter(group == 'A') %>% dcast(group + id ~ time, value.var ='var')
b = dt %>% filter(group == 'B') %>% dcast(group + id ~ time, value.var = 'var')
bind_rows(a,b)
My problem is that I would like to avoid doing it by hand. I mean, having to store separately each groups, a = ..., b = ..., c = ..., and so on
Any idea how I could have a single pipe stream that would separate each group, compute the transformation and put it back together in a dataframe ?
dt = structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("1", "2"), class = "factor"), var = structure(c(1L,
2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 3L, 3L, 3L), .Label = c("a",
"b", "c"), class = "factor"), time = structure(c(1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1",
"2", "3"), class = "factor")), .Names = c("group", "id",
"var", "time"), row.names = c(NA, -12L), class = "data.frame")

Package purrr can be useful for working with lists. First split the dataset by group and then use map_df to dcast each list but return everything in a single data.frame.
library(purrr)
dt %>%
split(.$group) %>%
map_df(~dcast(.x, group + id ~ time, value.var = "var"))
group id 1 2 3
1 A 1 a b a
2 A 2 b b b
3 B 1 a a a
4 B 2 c c c

lapply is your friend here:
do.call(rbind, lapply(unique(dt$Group), function(grp, dt){
dt %>% filter(Group == grp) %>% dcast(group + id ~ time, value.var = "var")
}, dt = dt))

'Complex' aggregation function in dcast from reshape2

I have a dataframe in long form for which I need to aggregate several observations taken on a particular day.
Example data:
long <- structure(list(Day = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("1", "2"), class = "factor"),
Genotype = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L,
2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), View = structure(c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1",
"2", "3"), class = "factor"), variable = c(1496L, 1704L,
1738L, 1553L, 1834L, 1421L, 1208L, 1845L, 1325L, 1264L, 1920L,
1735L)), .Names = c("Day", "Genotype", "View", "variable"), row.names = c(NA, -12L),
class = "data.frame")
> long
Day Genotype View variable
1 1 A 1 1496
2 1 A 2 1704
3 1 A 3 1738
4 1 B 1 1553
5 1 B 2 1834
6 1 B 3 1421
7 2 A 1 1208
8 2 A 2 1845
9 2 A 3 1325
10 2 B 1 1264
11 2 B 2 1920
12 2 B 3 1735
I need to aggregate each genotype for each day by taking the cube root of the product of each view. So for genotype A on day 1, (1496 * 1704 * 1738)^(1/3). Final dataframe would look like:
Day Genotype summary
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790
Have been going round and round with reshape2 for the last couple of days, but not getting anywhere. Help appreciated!

I'd probably use plyr and ddply for this task:
library(plyr)
ddply(long, .(Day, Genotype), summarize,
summary = prod(variable) ^ (1/3))
#-----
Day Genotype summary
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790
Or this with dcast:
dcast(data = long, Day + Genotype ~ .,
value.var = "variable", function(x) prod(x) ^ (1/3))
#-----
Day Genotype NA
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790

An other solution without additional packages.
aggregate(list(Summary=long$variable),by=list(Day=long$Day,Genotype=long$Genotype),function(x) prod(x)^(1/length(x)))
Day Genotype Summary
1 1 A 1642.418
2 2 A 1434.695
3 1 B 1593.633
4 2 B 1614.790

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Counting where the string appears in row with caveat - [R] - r

Related

Degrouping from one row per group to one row per subject

subsetting data based with the condition of the current and previous entity in r

R - dplyr map slice for repeat rows

R - how to avoid repeating filter & row bind

'Complex' aggregation function in dcast from reshape2

Categories

Resources