Split dataset based on column with a loop

Split dataset based on column with a loop - r

I have been trying to get a loop that splits a dataset in multiple datasets based on a column value. However, the dataset is of a format I haven't handled before (i.e. a list containing both lists and data.tables). The dataset is reproducible by:
table1 <- data.table::data.table(Scenario =
c(rep(
c("A", "B", "C", "D"),
4)),
A = c(
rep("x", 4), rep("b", 4), rep("s", 4),
rep("u", 4)),
Correlation = c(1, 0.125, 0.1, 0,
0.125, 1, 0.2, 0,
0.1, 0.2, 1, 0,
0, 0, 0, 1),
Matrix = "IM",
stringsAsFactors = FALSE,
check.names = FALSE)
table2 <- data.table::data.table(Scenario =
c(rep(
c("A", "B", "C", "D"),
4)),
A = c(
rep("x", 4), rep("b", 4), rep("s", 4),
rep("u", 4)),
Correlation = c(1, 0.125, 0.1, 0,
0.125, 1, 0.2, 0,
0.1, 0.2, 1, 0,
0, 0, 0, 1),
Matrix = "IM",
stringsAsFactors = FALSE,
check.names = FALSE)
table3 <- data.table::data.table(Scenario =
c(rep(
c("A", "B", "C", "D"),
4)),
A = c(
rep("x", 4), rep("b", 4), rep("s", 4),
rep("u", 4)),
Correlation = c(1, 0.125, 0.1, 0,
0.125, 1, 0.2, 0,
0.1, 0.2, 1, 0,
0, 0, 0, 1),
Matrix = "IM",
stringsAsFactors = FALSE,
check.names = FALSE)
list1 <- list("a" = "2019", "b" = "2020", "c" = "2021")
list2 <- list("a" = "test", "b" = "test", "c" = "test")
input_data <- list("table1" = table1, "table2" = table2, "table3" = table3,
"list1"=list1, "list2" = list2)
I need a loop that splits this dataset based on all unique instances in the scenario column. The first dataset (for scenario value "A") is reproducible by:
table1 <- data.table::data.table(Scenario =
c(rep(
c("A"),
4)),
A = c(
rep("x", 1), rep("b", 1), rep("s", 1),
rep("u", 1)),
Correlation = c(1, 0.125, 0.1, 0 ),
Matrix = "IM",
stringsAsFactors = FALSE,
check.names = FALSE)
table2 <- data.table::data.table(Scenario =
c(rep(
c( "A"),
4)),
A = c(
rep("x", 1), rep("b", 1), rep("s", 1),
rep("u", 1)),
Correlation = c(1, 0.125, 0.1, 0),
Matrix = "IM",
stringsAsFactors = FALSE,
check.names = FALSE)
table3 <- data.table::data.table(Scenario =
c(rep(
c("A"),
4)),
A = c(
rep("x", 1), rep("b", 1), rep("s", 1),
rep("u", 1)),
Correlation = c(1, 0.125, 0.1, 0),
Matrix = "IM",
stringsAsFactors = FALSE,
check.names = FALSE)
list1 <- list("a" = "2019", "b" = "2020", "c" = "2021")
list2 <- list("a" = "test", "b" = "test", "c" = "test")
input_data <- list("table1" = table1, "table2" = table2, "table3" = table3,
"list1"=list1, "list2" = list2)
Please let me know if additional information is needed.

You can write a function that wraps lapply, utilizing inherits as a check for the type of each object in the list. If the object inherits from data.frame and contains a column called Scenario then you can simply subset it. Items that are not data frames or data tables, or those that do not have columns called Scenario are left unaltered:
get_scenario <- function(S) {
lapply(input_data, function(x) {
if(!inherits(x, "data.frame"))
return(x)
else if(!"Scenario" %in% names(x))
return(x)
return(x[x$Scenario == S,])
})
}
This allows:
get_scenario("A")
#> $table1
#> Scenario A Correlation Matrix
#> 1: A x 1.000 IM
#> 2: A b 0.125 IM
#> 3: A s 0.100 IM
#> 4: A u 0.000 IM
#>
#> $table2
#> Scenario A Correlation Matrix
#> 1: A x 1.000 IM
#> 2: A b 0.125 IM
#> 3: A s 0.100 IM
#> 4: A u 0.000 IM
#>
#> $table3
#> Scenario A Correlation Matrix
#> 1: A x 1.000 IM
#> 2: A b 0.125 IM
#> 3: A s 0.100 IM
#> 4: A u 0.000 IM
#>
#> $list1
#> $list1$a
#> [1] "2019"
#>
#> $list1$b
#> [1] "2020"
#>
#> $list1$c
#> [1] "2021"
#>
#>
#> $list2
#> $list2$a
#> [1] "test"
#>
#> $list2$b
#> [1] "test"
#>
#> $list2$c
#> [1] "test"
And if you want all subgroups as one uber-list, you can do:
lapply(c("A", "B", "C"), get_scenario)

Related

Iteration over 2 lists in R

I would like to do an iteration with 2 lists.
For a single case, I have one dataframe df1 and one vector v1.
My reproducible example as below.
df1 <- data.frame(n1 = c(2,2,0),
n2 = c(2,1,1),
n3 = c(0,1,1),
n4 = c(0,1,1))
v1 <- c(1,2,3)
Now, I calculate an value (ses.value) for each row using this code
x <- (v1 - apply(df1, 1, mean))/apply(df1,1,sd)
Let's say we will have a list of multiple dataframes l1 and a list of vectors l2 (each list has the same number of elements) Now, I would like to run a loop for those lists by using the above code (the element of l1 must go with the element of l2 with the same position).
# 3 dataframes and 3 vectors
df1 <- data.frame(n1 = c(2,2,0), n2 = c(2,1,1), n3 = c(0,1,1), n4 = c(0,1,1))
df2 <- data.frame(n1 = c(1,6,0), n2 = c(2,1,8), n3 = c(0,2,1), n4 = c(0,7,1))
df3 <- data.frame(n1 = c(1,6,0), n2 = c(9,1,5), n3 = c(4,2,1), n4 = c(0,7,2))
v1 <- c(1,2,3)
v2 <- c(2,3,4)
v3 <- c(4,5,6)
# list
l1 <- list(df1,df2,df3)
l2 <- list(v1,v2,v3)
Since my lists are too big, using for loop might be not such a good idea, any suggestions using lapply or something similar?

We can use Map to loop over the corresponding elements of each list and then do the calculation based on OP's code
Map(function(x, y) (y - apply(x, 1, mean))/apply(x,1,sd), l1, l2)
-output
[[1]]
[1] 0.0 1.5 4.5
[[2]]
[1] 1.3055824 -0.3396831 0.4057513
[[3]]
[1] 0.1237179 0.3396831 1.8516402
Also, if the datasets are really big, use dapply from collapse, which is more efficient
library(collapse)
Map(function(x, y) (y - dapply(x, MARGIN = 1,
FUN = fmean))/dapply(x, MARGIN = 1, FUN = fsd), l1, l2)

Since your lists apparently are large, you probably could benefit from rowMeans2 and rowSds of the matrixStats package.
library(matrixStats)
Map(\(x, y) (y - rowMeans2(as.matrix(x))) / rowSds(as.matrix(x)), l1, l2)
# [[1]]
# [1] 0.0 1.5 4.5
#
# [[2]]
# [1] 1.3055824 -0.3396831 0.4057513
#
# [[3]]
# [1] 0.1237179 0.3396831 1.8516402
Data:
l1 <- list(structure(list(n1 = c(2, 2, 0), n2 = c(2, 1, 1), n3 = c(0,
1, 1), n4 = c(0, 1, 1)), class = "data.frame", row.names = c(NA,
-3L)), structure(list(n1 = c(1, 6, 0), n2 = c(2, 1, 8), n3 = c(0,
2, 1), n4 = c(0, 7, 1)), class = "data.frame", row.names = c(NA,
-3L)), structure(list(n1 = c(1, 6, 0), n2 = c(9, 1, 5), n3 = c(4,
2, 1), n4 = c(0, 7, 2)), class = "data.frame", row.names = c(NA,
-3L)))
l2 <- list(c(1, 2, 3), c(2, 3, 4), c(4, 5, 6))

Building sequence data for a recommender system- replacing cross-tabular matrix with a variable value

I am trying to build a sequence data for a recommender system. I have built a cross-tabular data (Table 1) and Table 2 as shown below:
enter image description here
I have been trying to replace all the 1's in Table 1 by the "Grade" from the Table 2 in R.
Any insight/suggestion is greatly appreciated.

Instead of replacing the first one with second, the second table and directly changed to 'wide' with dcast
library(reshape2)
res <- dcast(df2, St.No. ~ Courses, value.var = 'Grade')[names(df1)]
res
# St.No. Math Phys Chem CS
#1 1 A B
#2 2 B B
#3 3 A A C
#4 4 B B D
If we need to replace the blanks with 0
res[res =='"] <- "0"
data
df1 <- data.frame(St.No. = 1:4, Math = c(0, 0, 1, 1), Phys = c(1, 1, 0, 1),
Chem = c(0, 1, 1, 0), CS = c(1, 0, 1, 1))
df2 <- data.frame(St.No. = rep(1:4, each = 4), Courses = rep(c("Math",
"Phys", "Chem", "CS"), 4),
Grade = c("", "A", "", "B", "", "B", "B", "",
"A", "", "A", "C", "B", "B", "", "D"),
stringsAsFactors = FALSE)

Coding help in R - Subset and colSum is the topic [duplicate]

If I have a table like this:
user,v1,v2,v3
a,1,0,0
a,1,0,1
b,1,0,0
b,2,0,3
c,1,1,1
How to I turn it into this?
user,v1,v2,v3
a,2,0,1
b,3,0,3
c,1,1,1

In base R,
D <- matrix(c(1, 0, 0,
1, 0, 1,
1, 0, 0,
2, 0, 3,
1, 1, 1),
ncol=3, byrow=TRUE, dimnames=list(1:5, c("v1", "v2", "v3")))
D <- data.frame(user=c("a", "a", "b", "b", "c"), D)
aggregate(. ~ user, D, sum)
Returns
> aggregate(. ~ user, D, sum)
user v1 v2 v3
1 a 2 0 1
2 b 3 0 3
3 c 1 1 1

You can use dplyr for this:
library(dplyr)
df = data.frame(
user = c("a", "a", "b", "b", "c"),
v1 = c(1, 1, 1, 2, 1),
v2 = c(0, 0, 0, 0, 1),
v3 = c(0, 1, 0, 3, 1))
group_by(df, user) %>%
summarize(v1_sum = sum(v1),
v2_sum = sum(v2),
v3_sum = sum(v3))
If you're not familiar with the %>% notation, it is basically like piping from bash. It takes the output from group_by() and puts it into summarize(). The same thing would be accomplished this way:
by_user = group_by(df, user)
df_summarized = summarize(by_user,
v1_sum = sum(v1),
v2_sum = sum(v2),
v3_sum = sum(v3))

colSums and group by [duplicate]

If I have a table like this:
user,v1,v2,v3
a,1,0,0
a,1,0,1
b,1,0,0
b,2,0,3
c,1,1,1
How to I turn it into this?
user,v1,v2,v3
a,2,0,1
b,3,0,3
c,1,1,1

In base R,
D <- matrix(c(1, 0, 0,
1, 0, 1,
1, 0, 0,
2, 0, 3,
1, 1, 1),
ncol=3, byrow=TRUE, dimnames=list(1:5, c("v1", "v2", "v3")))
D <- data.frame(user=c("a", "a", "b", "b", "c"), D)
aggregate(. ~ user, D, sum)
Returns
> aggregate(. ~ user, D, sum)
user v1 v2 v3
1 a 2 0 1
2 b 3 0 3
3 c 1 1 1

You can use dplyr for this:
library(dplyr)
df = data.frame(
user = c("a", "a", "b", "b", "c"),
v1 = c(1, 1, 1, 2, 1),
v2 = c(0, 0, 0, 0, 1),
v3 = c(0, 1, 0, 3, 1))
group_by(df, user) %>%
summarize(v1_sum = sum(v1),
v2_sum = sum(v2),
v3_sum = sum(v3))
If you're not familiar with the %>% notation, it is basically like piping from bash. It takes the output from group_by() and puts it into summarize(). The same thing would be accomplished this way:
by_user = group_by(df, user)
df_summarized = summarize(by_user,
v1_sum = sum(v1),
v2_sum = sum(v2),
v3_sum = sum(v3))

R: how to sum columns grouped by a factor?

If I have a table like this:
user,v1,v2,v3
a,1,0,0
a,1,0,1
b,1,0,0
b,2,0,3
c,1,1,1
How to I turn it into this?
user,v1,v2,v3
a,2,0,1
b,3,0,3
c,1,1,1

In base R,
D <- matrix(c(1, 0, 0,
1, 0, 1,
1, 0, 0,
2, 0, 3,
1, 1, 1),
ncol=3, byrow=TRUE, dimnames=list(1:5, c("v1", "v2", "v3")))
D <- data.frame(user=c("a", "a", "b", "b", "c"), D)
aggregate(. ~ user, D, sum)
Returns
> aggregate(. ~ user, D, sum)
user v1 v2 v3
1 a 2 0 1
2 b 3 0 3
3 c 1 1 1

You can use dplyr for this:
library(dplyr)
df = data.frame(
user = c("a", "a", "b", "b", "c"),
v1 = c(1, 1, 1, 2, 1),
v2 = c(0, 0, 0, 0, 1),
v3 = c(0, 1, 0, 3, 1))
group_by(df, user) %>%
summarize(v1_sum = sum(v1),
v2_sum = sum(v2),
v3_sum = sum(v3))
If you're not familiar with the %>% notation, it is basically like piping from bash. It takes the output from group_by() and puts it into summarize(). The same thing would be accomplished this way:
by_user = group_by(df, user)
df_summarized = summarize(by_user,
v1_sum = sum(v1),
v2_sum = sum(v2),
v3_sum = sum(v3))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Split dataset based on column with a loop - r

Related

Iteration over 2 lists in R

Building sequence data for a recommender system- replacing cross-tabular matrix with a variable value

Coding help in R - Subset and colSum is the topic [duplicate]

colSums and group by [duplicate]

R: how to sum columns grouped by a factor?

Categories

Resources