I would like to add a new column that is "numbers lagged" but if the group changes I do not want to pick up the previous group as a lagged numbers. This is shown below using ifelse but how would you do this with apply()?
mydata=data.frame(groups = c("A","A","B","B"), numbers= c(1,2,3,4))
mydata$numbers_lagged = lag(mydata$numbers, k=1)
mydata$groups_lagged= lag(mydata$groups, k=1)
mydata$numbers_lagged= ifelse(mydata$groups != mydata$groups_lagged,NA, mydata$numbers_lagged) #if the group does not equal the previous group then set to NA
mydata
Compare with apply I recommend using group_by in dplyr
library(dplyr)
mydata%>%group_by(groups)%>%dplyr::mutate(numbers_lagged=lag(numbers))%>%
ungroup()%>%
arrange(groups)%>%
mutate(groups_lagged=lag(groups))
groups numbers numbers_lagged groups_lagged
<fctr> <dbl> <dbl> <fctr>
1 A 1 NA NA
2 A 2 1 A
3 B 3 NA A
4 B 4 3 B
Related
In R, I'm trying to average a subset of a column based on selecting a certain value (ID) in another column. Consider the example of choosing an ID among 100 IDs, perhaps the ID number being 5. Then, I want to average a subset of values in another column that corresponds to the ID number that is 5. Then, I want to do the same thing for the rest of the IDs. What should this function be?
Using dplyr:
library(dplyr)
dt <- data.frame(ID = rep(1:3, each=3), values = runif(9, 1, 100))
dt %>%
group_by(ID) %>%
summarise(avg = mean(values))
Output:
ID avg
<int> <dbl>
1 1 41.9
2 2 79.8
3 3 39.3
Data:
ID values
1 1 8.628964
2 1 99.767843
3 1 17.438596
4 2 79.700918
5 2 87.647472
6 2 72.135906
7 3 53.845573
8 3 50.205122
9 3 13.811414
We can use a group by mean. In base R, this can be done with aggregate
dt <- data.frame(ID = rep(1:3, each=3), values = runif(9, 1, 100))
aggregate(values ~ ID, dt, mean)
Output:
ID values
1 1 40.07086
2 2 53.59345
3 3 47.80675
I am trying to get a total number of friends that will become the denominator in a later step.
example data:
set.seed(24) ## for sake of reproducibility
n <- 5
data <- data.frame(id=1:n,
Q1= c("same", "diff", NA, NA, NA),
Q2= c("diff", "diff", "same", "diff", NA),
Q3= c("same", "diff", NA ,NA, "diff"),
Q4= c("diff", "same", NA, NA, NA))
i first need to create a column that contains a numeric count of how many columns each participant responded to (either "same" or "diff", not counting NAs/blanks). I have tried the following
friendship <- total.friends <- rowSums(c(data$Q1, data$Q2, data$Q3, data$Q4)), != "")
friendship <- total.friends <-rowSums(!is.na(c(data$Q1, data$Q2, data$Q3, data$Q4)))
Neither is effective, likely because my data is not numeric. the first did count the cells but did not group by id as I require. is there any function i can use to count the populated cells? how can i edit this to count cells populated only with "diff" so that i can then start on the second step (making the proportion)?
You could
data2 <- apply(data[,-1],MARGIN=1,function(x){c <- length(x[!is.na(x)])})
result <- as.data.frame(cbind(data[,1],data2)) %>% setNames(c("id","number"))
And result will hold the amount of not NA each id has.
The data2 is basically a count of the number of not NAs for each id, it uses the apply function with margin 1 which basically takes each row of your dataframe and applies a function to that row. The function that is being applied is the c<-length(x[!is.na(x)] part. Which basically, the 'x[!is.na(x)]' filters away all the NA entries in each row so that it only has NOT NA entries of the row, then we apply the length() function to that result so it gives us how many entries where there after filtering the NAs.
The result of that apply will be a single column array, in which each row is the result of computing that procedure to each row, and considering you have a row for each id. It translates as computing that function to each id
Lastly, in the result line I simply add the id back to the previous step, for the sake of having in it well identified and not just one column of results.
Hope this works for you :)
Here's a regex solution with grep:
data$count <- apply(data, 1, function(x) length(grep("[a-z]", x, value = T)))
Here using length you count the number of times grep finds a lower-case letter in any row cell.
Result:
data
id Q1 Q2 Q3 Q4 count
1 1 same diff same diff 4
2 2 diff diff diff same 4
3 3 <NA> same <NA> <NA> 1
4 4 <NA> diff <NA> <NA> 1
5 5 <NA> <NA> diff <NA> 1
You can also accomplish this using c_across and rowwise from the dplyr library:
library(dplyr)
data %>%
dplyr::rowwise() %>%
dplyr::mutate(Total = sum(!is.na(c_across(Q1:Q4)))) %>%
dplyr::ungroup()
Note: alternatively you can use starts_with("Q") inside of c_across to do this across all columns that start with "Q" (shown below).
To count the number of a specific response you can do or compute other variables that depend on a newly created variable, like a proportion, in the mutate statement:
data %>%
dplyr::rowwise() %>%
dplyr::mutate(Total = sum(!is.na(c_across(starts_with("Q")))),
Diff = sum(c_across(starts_with("Q")) == "diff", na.rm = T),
Prop = Diff / Total) %>%
dplyr::ungroup()
id Q1 Q2 Q3 Q4 Total Diff Prop
<int> <chr> <chr> <chr> <chr> <int> <int> <dbl>
1 1 same diff same diff 4 2 0.5
2 2 diff diff diff same 4 3 0.75
3 3 NA same NA NA 1 0 0
4 4 NA diff NA NA 1 1 1
5 5 NA NA diff NA 1 1 1
I have a data frame with three columns. Each row contains three unique numbers between 1 and 5 (inclusive).
df <- data.frame(a=c(1,4,2),
b=c(5,3,1),
c=c(3,1,5))
I want to use mutate to create two additional columns that, for each row, contain the two numbers between 1 and 5 that do not appear in the initial three columns in ascending order. The desired data frame in the example would be:
df2 <- data.frame(a=c(1,4,2),
b=c(5,3,1),
c=c(3,1,5),
d=c(2,2,3),
e=c(4,5,4))
I tried to use the below mutate function utilizing setdiff to accomplish this, but returned NAs rather than the values I was looking for:
df <- df %>% mutate(d=setdiff(c(a,b,c),c(1:5))[1],
e=setdiff(c(a,b,c),c(1:5))[2])
I can get around this by looping through each row (or using an apply function) but would prefer a mutate approach if possible.
Thank you for your help!
Base R:
cbind(df, t(apply(df, 1, setdiff, x = 1:5)))
# a b c 1 2
# 1 1 5 3 2 4
# 2 4 3 1 2 5
# 3 2 1 5 3 4
Warning: if there are any non-numerical columns, apply will happily up-convert things (converting to a matrix internally).
We can use pmap to loop over the rows, create a list column and then unnest it to create two new columns
library(dplyr)
librayr(purrr)
library(tidyr)
df %>%
mutate(out = pmap(., ~ setdiff(1:5, c(...)) %>%
as.list%>%
set_names(c('d', 'e')))) %%>%
unnest_wider(c(out))
# A tibble: 3 x 5
# a b c d e
# <dbl> <dbl> <dbl> <int> <int>
#1 1 5 3 2 4
#2 4 3 1 2 5
#3 2 1 5 3 4
Or using base R
df[c('d', 'e')] <- do.call(rbind, lapply(asplit(df, 1), function(x) setdiff(1:5, x)))
Here is a very similar question:
Aggregate multiple rows of the same data.frame in R based on common values in given columns
In my situation, the selection of columns is changing in different simulated samples. I have the selected column indices in each simulation. How can I use the function aggregate on indices instead of variable names? Namely, in the answer of that question, how can I use a code like this:
c=c(1,2,3)
aggregate(value ~ df[,c], FUN = mean, data=df) # comparing to aggregate(value ~ item + size + weight, FUN = mean, data=df)
(Please note that the above line won't run in R.)
Thank you for any help!
Without using the formula method, subset the column 'value' and the grouping columns in the by and specify the function
aggregate(df["value"], df[,c], FUN = mean)
#. item size weight value
#1 B 1 2 3
#2 C 3 2 1
#3 A 2 3 5
With the formula method, subset the grouping columns along with the columns that we want to get the mean of and use . to specify all the columns in the subset dataset
aggregate(value ~ ., data= df[, c('value', names(df)[c])], mean)
# item size weight value
#1 B 1 2 3
#2 C 3 2 1
#3 A 2 3 5
--
If we want to use dplyr, use group_by_at and specify the c variables in it
library(dplyr)
df %>%
group_by_at(c) %>%
# or extract column names, convert to symbol, and evaluate (!!!)
#group_by(!!! rlang::syms(names(.)[c])) %>%
summarise(value = mean(value))
# A tibble: 3 x 4
# Groups: item, size [?]
# item size weight value
# <fct> <int> <int> <dbl>
#1 A 2 3 5
#2 B 1 2 3
#3 C 3 2 1
NOTE: The input dataset is taken from the link in the OP's post
I am trying to calculate a grouped rolling sum based on a window size k but, in the event that the within group row index (n) is less than k, I want to calculate the rolling sum using the condition k=min(n,k).
My issue is similar to this question R dplyr rolling sum but I am looking for a solution that provides a non-NA value for each row.
I can get part of the way there using dplyr and rollsum:
library(zoo)
library(dplyr)
df <- data.frame(Date=rep(seq(as.Date("2000-01-01"),
as.Date("2000-12-01"),by="month"),2),
ID=c(rep(1,12),rep(2,12)),value=1)
df <- tbl_df(df)
df <- df %>%
group_by(ID) %>%
mutate(total3mo=rollsum(x=value,k=3,align="right",fill="NA"))
df
Source: local data frame [24 x 4]
Groups: ID [2]
Date ID value tota3mo
(date) (dbl) (dbl) (dbl)
1 2000-01-01 1 1 NA
2 2000-02-01 1 1 NA
3 2000-03-01 1 1 3
4 2000-04-01 1 1 3
5 2000-05-01 1 1 3
6 2000-06-01 1 1 3
7 2000-07-01 1 1 3
8 2000-08-01 1 1 3
9 2000-09-01 1 1 3
10 2000-10-01 1 1 3
.. ... ... ... ...
In this case, what I would like is to return the value 1 for observations on 2000-01-01 and the value 2 for observations on 2000-02-01. More generally, I would like the rolling sum to be calculated over the largest window possible but no larger than k.
In this particular case it's not too difficult to change some NA values by hand. However, ultimately I would like to add several more columns to my data frame that will be rolling sums calculated over various windows. In this more general case it will get quite tedious to go back change many NA values by hand.
Using the partial=TRUE argument of rollapplyr :
df %>%
group_by(ID) %>%
mutate(roll = rollapplyr(value, 3, sum, partial = TRUE)) %>%
ungroup()
or without dplyr (still need zoo):
roll <- function(x) rollapplyr(x, 3, sum, partial = TRUE)
transform(df, roll = ave(value, ID, FUN = roll))