conditional counting and grouping for the whole dataframe - r

I have this dataframe:
> df <- data.frame(Semester = sample(1:4, 20, replace=TRUE),
X1 = sample(c(1:7,NA), 20, replace =TRUE),
X2 = sample(c(1:7,NA), 20, replace =TRUE),
X3 = sample(c(1:7,NA), 20, replace =TRUE),
X4 = sample(c(1:7,NA), 20, replace =TRUE),
X5 = sample(c(1:7,NA), 20, replace =TRUE),
X6 = sample(c(1:7,NA), 20, replace =TRUE),
X7 = sample(c(1:7,NA), 20, replace =TRUE),
stringsAsFactors = FALSE)
> df
Semester X1 X2 X3 X4 X5 X6 X7
1 4 3 7 NA NA 1 2 7
2 3 NA 3 NA 4 3 2 6
3 1 2 5 3 4 7 NA 2
4 3 1 1 6 1 3 2 4
5 1 1 2 1 3 2 6 5
6 2 1 7 1 5 2 2 6
7 4 7 6 5 2 7 1 2
8 1 5 5 7 4 5 1 5
9 1 3 1 1 5 6 3 7
10 3 6 NA 1 1 5 NA 2
11 1 1 6 6 6 3 5 7
12 3 1 5 1 2 3 1 NA
13 4 1 4 1 1 5 6 1
14 1 5 4 4 NA 5 3 3
15 2 2 NA 4 1 1 5 4
16 3 6 7 6 7 3 3 7
17 1 1 2 4 5 4 5 3
18 4 4 7 7 6 NA 4 NA
19 3 4 2 3 4 4 3 5
20 2 1 NA 3 5 7 NA 6
And I'm trying to get this output, where n_* is the count for the number n_* for the all X* variables. For example, n_7 for Semester==1 is the count where X* values are 7 (This output is just referential, the values are artificial).
Semester n_7 n_6 n_5 n_4 n_3 n_2 n_1
1 5 7 1 5 7 7 7
2 4 10 1 3 6 3 4
3 5 5 2 5 3 3 2
4 3 9 10 5 7 0 0
I triedby(), but it counts the values of Semester also. Is there another way to do this?:
by(df, df$Semester,function(df){
count_if(eq(7), df)
count_if(eq(6), df)
count_if(eq(5), df)
count_if(eq(4), df)
count_if(eq(3), df)
count_if(eq(2), df)
count_if(eq(1), df)})

You could use a dcast() melt() approach.
library(data.table)
dcast(melt(df, "Semester"), Semester ~ value, fun=length)[-9]
# Semester 1 2 3 4 5 6 7
# 1 1 5 8 10 2 7 8 4
# 2 2 8 6 7 2 5 2 5
# 3 3 2 1 4 3 2 4 5
# 4 4 1 1 3 4 7 2 8

Related

Create lagged variables for several columns group by two conditions in r

I would like to create lagged variables for several columns that are grouped by two conditions.
Here is the dataset:
df <- data.frame(id = c(rep(1,4),rep(2,4)), tp = rep(1:4,2), x1 = 1:8, x2 = 2:9, x3 = 3:10, x4 = 4:11)
> df
id tp x1 x2 x3 x4
1 1 1 1 2 3 4
2 1 2 2 3 4 5
3 1 3 3 4 5 6
4 1 4 4 5 6 7
5 2 1 5 6 7 8
6 2 2 6 7 8 9
7 2 3 7 8 9 10
8 2 4 8 9 10 11
I want to lag x1, x2, x3, x4 that are grouped by id and tp and create new variables x1_lag1, x2_lag1, x3_lag1, x4_lag1, like this:
> df
id tp x1 x2 x3 x4 x1_lag1 x2_lag1 x3_lag1 x4_lag1
1 1 1 1 2 3 4 2 3 4 5
2 1 2 2 3 4 5 3 4 5 6
3 1 3 3 4 5 6 4 5 6 7
4 1 4 4 5 6 7 NA NA NA NA
5 2 1 5 6 7 8 6 7 8 9
6 2 2 6 7 8 9 7 8 9 10
7 2 3 7 8 9 10 8 9 10 11
8 2 4 8 9 10 11 NA NA NA NA
How to achieve that?
Your result doesn't seem to be grouped by tp at all. It is grouped by id and ordered by tp within the id grouping.
Generally a "lag" is a variable that takes the value from the previous row. The columns you want labeled as "lag" columns take the value from the next row, so we use the lead function.
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(starts_with("x"), lead, .names = "{.col}_lag1")) %>%
ungroup()
# A tibble: 8 × 10
id tp x1 x2 x3 x4 x1_lag1 x2_lag1 x3_lag1 x4_lag1
<dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 1 1 2 3 4 2 3 4 5
2 1 2 2 3 4 5 3 4 5 6
3 1 3 3 4 5 6 4 5 6 7
4 1 4 4 5 6 7 NA NA NA NA
5 2 1 5 6 7 8 6 7 8 9
6 2 2 6 7 8 9 7 8 9 10
7 2 3 7 8 9 10 8 9 10 11
8 2 4 8 9 10 11 NA NA NA NA

filtering scores from one variable and placing them in a new variable

##So I have this variable test scores is coded on a scale from 1-9.
I have to take those who score 1-3 as low, 4-6 as good and 7-9 as high in new variables.
then have to make a new variable that compares low and high and a variable that compares low and good.
test_scores<- c(sample(1:10, 122, replace = TRUE)
test_scores<-as.data.frame(test_scores)
low<- filter(test_scores,test_scores1 > 3)
high<- filter(test_scores, test_scores< 7)
good<-filter(test_scores,test_scores== 4:6)
##but the N of in the new variables are not counting up to 122
##I thought of using the if function:
low<- ifelse(test_scores$test_scores == 1:3 , 1:3 , 0)
mods<- ifelse(test_scores$test_scores == 4:6, 4:6, 0)
high<- ifelse(test_scores$test_scores == 7:9, 7:9, 0)
##but some scores are not getting filter instead they become 0 even tho the score matches. any ideas?
You can use "cut" to generate the new bins:
set.seed(123)
test_scores <- sample(1:9, 122, T)
test_scores
#> [1] 3 3 2 6 5 4 6 9 5 3 9 9 9 3 8 7 9 3 4 1 7 5 7 9 9 7 5 7 5 6 9 2 5 8 2 1 9
#> [38] 9 6 5 9 4 6 8 6 6 7 1 6 2 1 2 4 5 6 3 9 4 6 9 9 7 3 8 9 3 7 3 7 6 5 5 8 3
#> [75] 2 2 6 4 1 6 3 8 3 8 1 7 7 7 6 7 5 6 8 5 7 4 3 9 7 6 9 7 2 3 8 4 7 4 1 8 4
#> [112] 9 8 6 4 8 3 4 4 6 1 4
cuts <- cut(test_scores, c(0,3,6,9), labels = F)
cuts
#> [1] 1 1 1 2 2 2 2 3 2 1 3 3 3 1 3 3 3 1 2 1 3 2 3 3 3 3 2 3 2 2 3 1 2 3 1 1 3
#> [38] 3 2 2 3 2 2 3 2 2 3 1 2 1 1 1 2 2 2 1 3 2 2 3 3 3 1 3 3 1 3 1 3 2 2 2 3 1
#> [75] 1 1 2 2 1 2 1 3 1 3 1 3 3 3 2 3 2 2 3 2 3 2 1 3 3 2 3 3 1 1 3 2 3 2 1 3 2
#> [112] 3 3 2 2 3 1 2 2 2 1 2
if you want a variable for each bin, and zero otherwise, you must use %in%, not ==
low<- ifelse(test_scores$test_scores %in% 1:3 , test_scores$test_scores , 0)
mods<- ifelse(test_scores$test_scores %in% 4:6, test_scores$test_scores, 0)
high<- ifelse(test_scores$test_scores %in% 7:9, test_scores$test_scores, 0)

Is there an easy way to create a data frame in R with the same vector repeating itself "n" times (as n coumns)?

I think the title says it all
Let's jump to the example
Imagine I have a vector (the contents of which are not relevant for this example)
aux<-c(1:5)
I need to create a data frame that has the same vector repeating itself n times (n can vary, sometimes it is 8 times, sometimes it is 7)
I did it like this for repeating itself 8 times:
aux.df<-data.frame(aux,aux,aux,aux,aux,aux,aux,aux)
This got me the result I wanted but you can see why it's not an ideal way...
is there a package, function, way to tell R to repeat the vector 'aux' 8 times?
I also tried creating a matrix and then transforming it into a data frame but that didn't work and I got a weird data frame with vectors inside of each cell...
what I tried that didn't work:
aux.df<- as.data.frame(matrix(aux, nrows=5, ncol=8))
Using replicate().
as.data.frame(replicate(8, aux))
# V1 V2 V3 V4 V5 V6 V7 V8
# 1 1 1 1 1 1 1 1 1
# 2 2 2 2 2 2 2 2 2
# 3 3 3 3 3 3 3 3 3
# 4 4 4 4 4 4 4 4 4
# 5 5 5 5 5 5 5 5 5
parameters
aux<-c(1:5)
n<-8
vector aux repeated as columns
aux.df<-as.data.frame(matrix(rep(aux,n),ncol=n,byrow = F))
vector aux repeated as rows
aux.df<-as.data.frame(matrix(rep(aux,n),nrow=n,byrow = T))
Here are some possible opitons
> data.frame(aux = aux)[rep(1, 8)]
aux aux.1 aux.2 aux.3 aux.4 aux.5 aux.6 aux.7
1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5
> data.frame(kronecker(t(rep(1, 8)), aux))
X1 X2 X3 X4 X5 X6 X7 X8
1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5
> data.frame(outer(aux, rep(1, 8)))
X1 X2 X3 X4 X5 X6 X7 X8
1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5
> list2DF(rep(list(aux), 8))
1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5

How to lag multiple specific columns of a data frame in R

I would like to lag multiple specific columns of a data frame in R.
Let's take this generic example. Let's assume I have defined which columns of my dataframe I need to lag:
Lag <- c(0, 1, 0, 1)
Lag.Index <- is.element(Lag, 1)
df <- data.frame(x1 = 1:8, x2 = 1:8, x3 = 1:8, x4 = 1:8)
My initial dataframe:
x1 x2 x3 x4
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
I would like to compute the following dataframe:
x1 x2 x3 x4
1 1 NA 1 NA
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
I would know how to do it for only one lagged column as shown here, but not able to find a way to do it for multiple lagged columns in an elegant way. Any help is very much appreciated.
You can use purrr's map2_dfc to lag different values by column.
purrr::map2_dfc(df, Lag, dplyr::lag)
# x1 x2 x3 x4
# <int> <int> <int> <int>
#1 1 NA 1 NA
#2 2 1 2 1
#3 3 2 3 2
#4 4 3 4 3
#5 5 4 5 4
#6 6 5 6 5
#7 7 6 7 6
#8 8 7 8 7
Or with data.table :
library(data.table)
setDT(df)[, names(df) := Map(shift, .SD, Lag)]
A data.table option using shift along with Vectorize
> setDT(df)[, Vectorize(shift)(.SD, Lag)]
x1 x2 x3 x4
[1,] 1 NA 1 NA
[2,] 2 1 2 1
[3,] 3 2 3 2
[4,] 4 3 4 3
[5,] 5 4 5 4
[6,] 6 5 6 5
[7,] 7 6 7 6
[8,] 8 7 8 7
Not sure whether this is elegant enough, but I would use dplyr's mutate_at function to tweak columns
df %>% dplyr::mutate_at(.vars = vars(x2,x4),.funs = ~lag(., default = NA))
We convert the lag to logical class, get the corresponding names and use across from dplyr
library(dplyr)
df %>%
mutate(across(names(.)[as.logical(Lag)], lag))
# x1 x2 x3 x4
#1 1 NA 1 NA
#2 2 1 2 1
#3 3 2 3 2
#4 4 3 4 3
#5 5 4 5 4
#6 6 5 6 5
#7 7 6 7 6
#8 8 7 8 7
Or we can do this in base R
df[as.logical(Lag)] <- rbind(NA, df[-nrow(df), as.logical(Lag)])

From table to data.frame

I have a table that looks like:
dat = data.frame(expand.grid(x = 1:10, y = 1:10),
z = sample(LETTERS[1:3], size = 100, replace = TRUE))
tabl <- with(dat, table(z, y))
tabl
y
z 1 2 3 4 5 6 7 8 9 10
A 5 3 1 1 3 6 3 7 2 4
B 4 5 3 6 5 1 3 1 4 4
C 1 2 6 3 2 3 4 2 4 2
Now how do I transform it into a data.frame that looks like
1 2 3 4 5 6 7 8 9 10
A 5 3 1 1 3 6 3 7 2 4
B 4 5 3 6 5 1 3 1 4 4
C 1 2 6 3 2 3 4 2 4 2
Here are a couple of options.
The reason as.data.frame(tabl) doesn't work is that it dispatches to the S3 method as.data.frame.table() which does something useful but different from what you want.
as.data.frame.matrix(tabl)
# 1 2 3 4 5 6 7 8 9 10
# A 5 4 3 1 1 3 3 2 6 2
# B 1 4 3 4 5 3 4 4 3 3
# C 4 2 4 5 4 4 3 4 1 5
## This will also work
as.data.frame(unclass(tabl))

Resources