Dynamic sum/count condition while assignment - r

I have two data frames (table1 and randomdata) with the following schema:
#randomdata
randomdata$cube = {1,5,3,3,4,5,5,2,2,6,1,2,....} (1000 rows)
#table1
table1$side = {1,2,3,4,5,6} (6 rows)
table1$frequency = NULL
I want to count the occurence from the different sides of the cube (of the first 10 rows from randomdata$cube) and assign the result to table1$frequency to the corresponding row (based on table1$side).
I can do this successfuly this way:
table1$frequency[1] <- sum(randomdata$cube[1:10] == 1)
table1$frequency[2] <- sum(randomdata$cube[1:10] == 2)
table1$frequency[3] <- sum(randomdata$cube[1:10] == 3)
...
table1$frequency[6] <- sum(randomdata$cube[1:10] == 6)
This works very well, but there must be a better way.
Instead of 6 statements, I imagine something like this:
table1$frequency <- sum(randomdata$cube[1:10] == table1$side)
Can someone show me a more dynamic way to do this?
Thank you.

We can do this with converting the 'cube' column to factor with levels specified as 1:6 and then do the table. If we do it without that, missing elements can get dropped out of the table output. Here, it would be 0 if a level is missing
table1$frequency <- table(factor(randomdata$cube[1:10], levels = 1:6))
Or using tidyverse
library(tidyverse)
randomdata %>%
slice(1:6) %>%
count(cube = factor(cube, levels = 1:6), .drop = FALSE) %>%
pull(n) %>%
mutate(table1, frequency = .)

Related

subset of a large data set including 2000 variables-Removing variables when just one of their level has non-zero frequnecy

+I have a very large data set including 2000 variables which the majority are factors. Some variables have more than one levels but just one of their level has frequency more than 1.
Suppose:
Agegroup : Group 1 (Freq=200) Group2(Freq=0) Group3(Freq=0).
I am looking for a loop function to check the frequency of all varibles and remove the variables when just one of their levels has non-zero frequnecy.
The following function works just for one variable, but how about checking all variables in the data set?
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(n()>cutoff)
To get rid of the levels that are empty, you can use droplevels.
I added comments in the code. If you have any questions, let me know.
df <- data.frame(x = factor(rep("x", 10), levels = c("x", "y")),
z = factor(c(rep("m", 5), rep("p", 5))))
summary(df)
remr <- vector(mode = "list")
invisible(lapply(1:ncol(df),
function(i) {
df <- df %>% droplevels()
if(length(unique(df[[i]])) < 2) { # less than 2 unique values
remr <<- append(remr, names(df)[i]) # make a list
} # remove at end, so col indices doesn't change
}
))
df1 <- df %>% select(-unlist(remr))
summary(df1) # inspect

Lookup tables in R

I have a tibble with a ton of data in it, but most importantly, I have a column that references a row in a lookup table by number (ex. 1,2,3 etc).
df <- tibble(ref = c(1,1,1,2,5)
data = c(33,34,35,35,32))
lkup <- tibble(CurveID <- c(1,2,3,4,5)
Slope <- c(-3.8,-3.5,-3.1,-3.3,-3.3)
Intercept <- c(40,38,40,38,36)
Min <- c(25,25,21,21,18)
Max <- c(36,36,38,37,32))
I need to do a calculation for each row in the original tibble based on the information in the referenced row in the lookup table.
df$result <- df$data - lkup$intercept[lkup$CurveID == df$ref]/lkup$slope[lkup$CurveID == df$ref]
The idea is to access the slope or intercept (etc) value from the correct row of the lookup table based on the number in the data table, and to do this for each data point in the column. But I keep getting an error telling me my data isn't compatible, and that my objects need to be of the same length.
You could also do it with match()
df$result <- df$data - lkup$Intercept[match(df$ref, lkup$CurveID)]/lkup$Slope[match(df$ref, lkup$CurveID)]
df$result
# [1] 43.52632 44.52632 45.52632 45.85714 42.90909
You could use the dplyr package to join the tibbles together. If the ref column and CurveID column have the same name then left_join will combine the two tibbles by the matching rows.
library(dplyr)
df <- tibble(CurveID = c(1,1,1,2,5),
data = c(33,34,35,35,32))
lkup <- tibble(CurveID = c(1,2,3,4,5),
Slope = c(-3.8,-3.5,-3.1,-3.3,-3.3),
Intercept = c(40,38,40,38,36),
Min = c(25,25,21,21,18),
Max = c(36,36,38,37,32))
df <- df %>% left_join(lkup, by = "CurveID")
Then do the calcuation on each row
df <- df %>% mutate(result = data - (Intercept/Slope)) %>%
select(CurveID, data, result)
For completeness' sake, here's one way to literally do what OP was trying:
library(slider)
df %>%
mutate(result = slide_dbl(ref, ~ slice(lkup, .x)$Intercept /
slice(lkup, .x)$Slope))
though since slice goes by row number, this relies on CurveID equalling the row number (we make no reference to CurveID at all). You can write it differently with filter but it ends up being more code.

Duplicated rows but also return the unique row which has duplicates

I'm currently working on data sets that have a lot of duplicated rows.
I need to look into those rows and also the unique rows which kept out from the duplicated() function in R.
I'm now using 3 steps to achieve it:
dt1 <- dt %>%
filter(., duplicated(ID) == TRUE) %>%
mutate(., dup = 1)
dt <- left_join(dt, dt1, by = "ID")
then
dt1 <- filter(dt, dup == 1)
Since I need to repeat these steps very often.
So I wonder if there is a more simple way that gives the same outputs?

Loop through multiple columns in R

I have the following code I'd like to run for multiple columns in a data frame called ccc.
ccc %>%
group_by(LA) %>%
summarise(Def = sum(DefaultOct05 == 'Def'),
NDef = sum(DefaultOct05 != 'Def'),
DRate = mean(DefaultOct05 == 'Def'))
LA is the name of one of the columns. How would I set up a loop to run through a number of different columns?
I've tried the following.
for (i in 26:ncol(ccc)) {
ccc %>%
group_by(i) %>%
summarise(Def = sum(DefaultOct05 == 'Def'),
NDef = sum(DefaultOct05 != 'Def'),
DRate = mean(DefaultOct05 == 'Def'))
}
But I get the following error message.
Error in resolve_vars(new_groups, tbl_vars(.data)) :
unknown variable to group by : i
What most people will miss in your question is a reproducible data set. Without it, its often very hard to reproduce your problem and solve it.
If I got you right, your data-set looks like the one above:
set.seed(1)
ccc=data.frame(Default=sample(c(0,1),100,replace = TRUE),LA=sample(c("X","Y","Z"),100,replace = TRUE),DC=sample(c("A","B","C"),100,replace = TRUE))
do.call() - applies rbind() to the subsequent elements.
lapply(dat,function(x)) applies the function to every element of dat - in our case columns.
library(dplyr)
do.call(rbind,lapply(ccc, function(Var) {
dat=data.frame(Var,Default=ccc$Default) %>% group_by(Var) %>% summarise(Def=sum(Default),NDef=n()-sum(Default),DRate=mean(Default))
return(as.data.frame(dat))
}
))
"LA is the name of one of the columns"
Actually, group by dplyr construction works on variables inside the columns. I guess you want to do other things.
If you want to apply the same function to different columns you could use summarize_at.
df <- data.frame( id = c(1:20),
a1 = runif(20),
b1 = runif(20),
c1 = runif(20)
)
library(dplyr)
df %>% summarise_at(c("a1","b1","c1"), funs(med = median,
avr = mean))
# result:
# a1_med b1_med c1_med a1_avr b1_avr c1_avr
# 1 0.6444056 0.5266252 0.6420554 0.5605837 0.4983654 0.5546381

Issues with replacing a subset of a data.frame using the R Package dplyr

I am trying to replace some filtered values of a data set. So far, I wrote this lines of code:
df %>%
filter(group1 == uniq[i]) %>%
mutate(values = ifelse(sum(values) < 1, 2, NA)),
where uniq is just a list containing variable names I want to focus on (and group1 and values are column names). This is actually working. However, it only outputs the altered filtered rows and does not replace anything in the data set df. Does anyone have an idea, where my mistake is? Thank you so much! The following code is to reproduce the example:
group1 <- c("A","A","A","B","B","C")
values <- c(0.6,0.3,0.1,0.2,0.8,0.9)
df = data.frame(group1, group2, values)
uniq <- unique(unlist(df$group1))
for (i in 1:length(uniq)){
df <- df %>%
filter(group1 == uniq[i]) %>%
mutate(values = ifelse(sum(values) < 1, 2, NA))
}
What I would like to get is that it leaves all values except the last one since it is one unique group (group1 == C) and 0.9 < 1. So I'd like to get the exact same data frame here except that 0.9 is replaced with NA. Moreover, would it be possible to just use if instead of ifelse?
dplyr won't create a new object unless you use an assignment operator (<-).
Compare
require(dplyr)
data(mtcars)
mtcars %>% filter(cyl == 4)
with
mtcars4 <- mtcars %>% filter(cyl == 4)
mtcars4
The data are the same, but in the second example the filtered data is stored in a new object mtcars4

Resources