Assign increasing index per group in grouped dataframe in R - r

I have a dataframe/tibble that looks like this:
# A tibble: 15 x 2
# Groups: id [3]
id date
<int> <date>
1 1 1998-02-13
2 1 1998-02-14
3 1 1998-02-15
4 1 1998-02-16
5 1 1998-02-17
6 2 1998-02-13
7 2 1998-02-14
8 2 1998-02-15
9 2 1998-02-16
10 2 1998-02-17
11 3 1998-02-13
12 3 1998-02-14
13 3 1998-02-15
14 3 1998-02-16
15 3 1998-02-17
I would like to add a variable "days_before_event" that would count from 1:5 (but this should not be hardcoded and rather be the elments per group). I though about doing something like this
df_long %>% mutate(days_before_event = 1:nrow(.))
where nrow(.) should be the the number of rows per group. This does not work and shows the error
Fehler: Problem with `mutate()` input `days_before_event`.
x Input `days_before_event` can't be recycled to size 5.
i Input `days_before_event` is `1:nrow(.)`.
i Input `days_before_event` must be size 5 or 1, not 15.
i The error occurred in group 1: id = 1.
Any trick on how i can achieve this?

We can use row_number instead of 1:nrow
library(dplyr)
df_long %>%
mutate(days_before_event =row_number())
Or using base R, use ave to group over the 'id' and get the sequence (seq_along)
df_long$days_before_event <- with(df_long, ave(seq_along(id),
id, FUN = seq_along))
The error is because it is a grouped dataset (from the group attribute showed in the printed data), therefore, 1:nrow will get the sequence from 1st to the last row of the entire dataset and not the last row of the group. This creates an imbalance in length and mutate can return only the output having the same length as the original data (or else have to wrap in a list)

Related

subset dataframe based on hierarchical preference of factor levels within column in R

I have a dataframe which I would like to subset based on hierarchical preference of factor levels within a column. With following example I want to show, that per level of "ID" I want to select only one "method". Specifically, if possible keeping CACL, if CACL doesn't exist for this level, then subset for "KCL" and if that doesn't exist, then subset for "H2O".
ID<-c(1,1,1,2,2,3)
method<-c("CACL","KCL","H2O","H2O","KCL","H2O")
df1<-data.frame(ID,method)
ID method
1 1 CACL
2 1 KCL
3 1 H2O
4 2 H2O
5 2 KCL
6 3 H2O
ID<-c(1,2,3)
method<-c("CACL","KCL","H2O")
df2<-data.frame(ID,method)
ID method
1 1 CACL
2 2 KCL
3 3 H2O
I have done something similar subsetting by selecting a minimum number within a level, but am not able to adapt it. Am wondering whether I should use ifelse here too?
#if present, choose rows containing "number" 2 instead of 1 (this column contained only the two numbers 1 and 2)
library(dplyr)
new<-df %>%
group_by(col1,col2,col3) %>%
summarize(number = ifelse(any(number > 1), min(number[number>1]),1))
dfnew<-merge(new,df,by=c("colxyz","number"),all.x=T)
You can use order with match and then simply !duplicated:
df1 <- df1[order(match(df1$method, c("CACL","KCL","H2O"))),]
df1[!duplicated(df1$ID),]
# ID method
#1 1 CACL
#5 2 KCL
#6 3 H2O
#Variant not changing df1
i <- order(match(df1$method, c("CACL","KCL","H2O")))
df1[i[!duplicated(df1$ID[i])],]
An option using dplyr:
df1 %>%
mutate(preference = match(method, c("CACL","KCL","H2O"))) %>%
group_by(ID) %>%
filter(preference == min(preference)) %>%
select(-preference)
# A tibble: 3 x 2
# Groups: ID [3]
ID method
<dbl> <fct>
1 1 CACL
2 2 KCL
3 3 H2O

How to create new dataframe by repeating 1 column and sequentially repeating a 2nd column of original dataframe?

Im trying to create a new dataframe by repeating values in original-df column 1 and corresponding them to repeating values from original-df column 2. However, the values should repeat in a different manner for each column. For example, values from original-df column 1 will repeat as 1,2,3,1,2,3,1,2,3. Where as values from original-df column 2 should repeat as 1,1,1,2,2,2,3,3,3.
#here is original df
df1<-data.frame(x=1:3, y=10:12)
#I've tried the followig:
data.frame(x=df1$x,y=df1[,2])->df2
range<-1:3
data.frame(x=df1$x,y=df1$y[range,2])->df3
#I then tried this:
rep(df1$x,df1$y[l,2])->df4
#output either looks like this:
x y
1 1 10
2 2 11
3 3 12
#Or I receive an error message:
Error in df1$y[1, 2] : incorrect number of dimensions
#I expect data output to look like this:
x y
1 10
2 10
3 10
1 11
2 11
3 11
1 12
2 12
3 12
An option would be expand
library(tidyr)
expand(df1, x, y)
Or with expand.grid from base R
do.call(expand.grid, df1)

How to merge data to apply to all unique conditions of a column in second data set, even when not occuring

I am trying to insert new rows of data based on unique values of a column in my original data set. I have the following dummy data set:
sites<-c("10","10","11","11","12","12")
ID<-c("A","A","B","B","C","D")
value<-c("4","6","5","2","7","8")
dataframe<-data.frame(sites, ID, value)
sites<-c("10","10","11","11","12","12","13","14","15")
dataframe2<-data.frame(sites)
Producing:
sites ID value
10 A 4
10 A 6
11 B 5
11 B 2
12 C 7
12 D 8
sites
10
10
11
11
12
12
13
14
15
For each unique value in column ID, I would like each site number from the second data frame applied, and when there is no value I would like it to print 0.
So for example, ID A would have all sites from site2 listed and when there is no value (ie for site 11, 12, 13,14) I would like it to list 0 for value.
I have tried the following:
mergeddata<-merge(dataframe, dataframe2, by="sites", all.y=TRUE)
But that only adds the new sites at the bottom with NA's for each value other than site. I want dataframe2 to be applied for each unique value under column ID, so that each ID has an occurrence of all sites. I'm not sure what the best way to go about this would be, any help is much appreciated!
This could be a job for complete() from package tidyr. You can group your first dataset by ID and then use complete() to add rows for the site values from dataframe2 within each group.
This results in having at least one row for each site in each ID. I use the fill argument to add the 0 to value for the new rows (after converting value to numeric).
library(dplyr)
library(tidyr)
dataframe$value = as.numeric( as.character(dataframe$value) )
dataframe %>%
group_by(ID) %>%
complete(sites = dataframe2$sites, fill = list(value = 0) )
# A tibble: 26 x 3
# Groups: ID [4]
ID sites value
<fct> <chr> <dbl>
1 A 10 4
2 A 10 6
3 A 11 0
4 A 12 0
5 A 13 0
6 A 14 0
7 A 15 0
8 B 10 0
9 B 11 5
10 B 11 2
# ... with 16 more rows
Warning message:
Column `sites` joining factors with different levels, coercing to character vector
The warning message has to do with site being a factor in the two datasets, which complete() deals with by converting the two columns to characters instead.

countif within R repeated across each row

I'm having trouble trying to replicate some of the countif function I'm familiar with in excel. I've got a data frame, and it has a large number of rows. I'm trying to take 2 variables (x & z) and do a countif of how many other variables within my dataframe match that. I figured out doing:
sum('mydataframe'$x==`mydataframe`$x[1]&`mydataframe'$z==`mydataframe`$z[1])
This gives me the correct countif for x&z within the whole data set for the first row [1]. The problem is I've got to use that [1]. I've tried using the (with,...) command, but then I can no longer access the whole column.
I'd like to be able to do the count of x & z combination for each row within the data frame then have that output as a new vector that I can just add as another column. And I'd like this to go on for every row through to the end.
Hopefully this is pretty simple. I figure some combination of (with,..) or apply or something will do it, but I'm just too new.
I am interested in a count total in every instance, not a running sequential count.
It seems that you are asking for a way to create a new column that contains the number of rows in the entire data frame with x and z value equal to the values of those variables for that row.
With a bit of sample data:
(dat <- data.frame(x=c(1, 1, 2), z=c(3, 3, 3)))
# x z
# 1 1 3
# 2 1 3
# 3 2 3
One simple approach would be grouping with dplyr's group_by function and then creating a new column with the number of elements in that group:
library(dplyr)
dat %>% group_by(x, z) %>% mutate(n=n())
# x z n
# (dbl) (dbl) (int)
# 1 1 3 2
# 2 1 3 2
# 3 2 3 1
A base R solution would probably involve ave:
dat$n <- ave(rep(NA, nrow(dat)), dat$x, dat$z, FUN=length)
dat
# x z n
# 1 1 3 2
# 2 1 3 2
# 3 2 3 1
An option using data.table would be to convert the 'data.frame' to 'data.table' (setDT(dat)) , group by 'x', 'z' and
assign 'n' as the number of elements in each group (.N).
library(data.table)
setDT(dat)[, n:= .N, by = .(x,z)]
dat
# x z n
#1: 1 3 2
#2: 1 3 2
#3: 2 3 1

working with data in tables in R

I'm a newbie at working with R. I've got some data with multiple observations (i.e., rows) per subject. Each subject has a unique identifier (ID) and has another variable of interest (X) which is constant across each observation. The number of observations per subject differs.
The data might look like this:
ID Observation X
1 1 3
1 2 3
1 3 3
1 4 3
2 1 4
2 2 4
3 1 8
3 2 8
3 3 8
I'd like to find some code that would:
a) Identify the number of observations per subject
b) Identify subjects with greater than a certain number of observations (e.g., >= 15 observations)
c) For subjects with greater than a certain number of observations, I'd like to to manipulate the X value for each observation (e.g., I might want to subtract 1 from their X value, so I'd like to modify X for each observation to be X-1)
I might want to identify subjects with at least three observations and reduce their X value by 1. In the above, individuals #1 and #3 (ID) have at least three observations, and their X values--which are constant across all observations--are 3 and 8, respectively. I want to find code that would identify individuals #1 and #3 and then let me recode all of their X values into a different variable. Maybe I just want to subtract 1 from each X value. In that case, the code would then give me X values of (3-1=)2 for #1 and 7 for #3, but #2 would remain at X = 4.
Any suggestions appreciated, thanks!
You can use the aggregate function to do this.
a) Say your table is named temp, you can find the total number of observations for each ID and x column by using the SUM function in aggregate:
tot =aggregate(Observation~ID+x, temp,FUN = sum)
The output will look like this:
ID x Observation
1 1 3 10
2 2 4 3
3 3 8 6
b) To see the IDs that are over a certain number, you can create a subset of the table, tot.
vals = tot$ID[tot$Observation>5]
Output is:
[1] 1 3
c) To change the values that were found in (b) you reference the subsetted data, where the number of observations is > 5, and then update those values.
tot$x[vals] = tot$x[vals]+1
The final output for the table is
ID x Observation
1 1 4 10
2 2 4 3
3 3 9 6
To change the original table, you can subset the table by the IDs you found
temp[temp$ID %in% vals,]$x = temp[temp$ID %in% vals,]$x + 1
a) Identify the number of observations per subject
you can use this code on each variable:
summary

Resources