For i in loops in R - r

I have been really struggling to grasp a basic programming concept - the for loop. I typically deal with heirarchically structured data such that measurements repeat with levels of unique identifiers, like this:
ID Measure
1 2
1 3
1 3
2 4
2 1
...
Very often I need to create a new column the aggregates within ID or produces a value for each row for each level of ID. The former I use pretty basic functions from either base or dplyr, but for the latter case I'd like to get in the habit of creating for loops.
So for this example, I would like a column added to my hypothetical df such that the new column begins with one for each ID and adds 1 to each subsequent row, until a new ID occurs.
So, this:
ID Measure NewVal
1 2 1
1 3 2
1 3 3
2 4 1
2 1 2
...
Would love to learn for computing, but if there are other ways, would like to hear those too.

One way is to use the splitstackshape package. There is a function called getanID. This is your friend here. If your df is called mydf, you would do the following. Please note that the outcome is data.table. If necessary, you want to convert that to data.frame.
library(splitstackshape)
getanID(mydf, "ID")
# ID Measure .id
#1: 1 2 1
#2: 1 3 2
#3: 1 3 3
#4: 2 4 1
#5: 2 1 2
DATA
mydf <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L), Measure = c(2L, 3L,
3L, 4L, 1L)), .Names = c("ID", "Measure"), class = "data.frame", row.names = c(NA,
-5L))

seq_along gives an increasing sequence starting at 1, with the same length as its input. tapply is used to apply a function to various levels of input. Here we don't care what is supplied, so you can apply the ID column to itself:
> d$NewVal <- unlist(tapply(d$ID, d$ID, FUN=seq_along))
> d
ID Measure NewVal
1 1 2 1
2 1 3 2
3 1 3 3
4 2 4 1
5 2 1 2

You could also use data.table to assign the sequence by reference.
# library(data.table)
setDT(mydf) ## convert to data table
mydf[,NewVal := seq(.N), by=ID] ## .N contains number of rows in each ID group
# ID Measure NewVal
# 1: 1 2 1
# 2: 1 3 2
# 3: 1 3 3
# 4: 2 4 1
# 5: 2 1 2
setDF(mydf) ## convert easily to data frame if you wish.

Or you could use ave. The advantage is that it will give the sequence in the same order as that in the original dataset, which may be beneficial in unordered datasets.
transform(df, NewVal=ave(ID, ID, FUN=seq_along))
# ID Measure NewVal
#1 1 2 1
#2 1 3 2
#3 1 3 3
#4 2 4 1
#5 2 1 2
For a more general case (if the ID column is factor )
transform(df, NewVal=ave(seq_along(ID), ID, FUN=seq_along))
Or if the ID column is ordered
df$NewVal <- sequence(tabulate(df$ID))
Or using dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(NewVal=row_number())
data
df <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L), Measure = c(2L, 3L,
3L, 4L, 1L)), .Names = c("ID", "Measure"), class = "data.frame",
row.names = c(NA, -5L))

I'd recommend you don't use a for loop for this. It's not a good place for one. You can do this pretty easily inplyr (or dplyr) if you prefer:
require(plyr)
x <- data.frame(cbind(rnorm(100), rnorm(100)))
x$ID <- sample(1:10, 100, replace=T)
new_col <- function(x) {
x <- x[order(x[,1]), ]
x$NewVal <- 1:nrow(x)
return(x)
}
x <- ddply(.data= x, .var= "ID", .fun= new_col)

Related

How do I merge multiple contingency tables into one using R?

I have multiple columns that I need to merge and return a contingency table counting each number.
Example of an ordinal data set:
df <- data.frame(ab = c(1,2,3,4,5),
ba = c(1,3,3,3,5))
>ab ba
1 1
2 3
3 3
4 3
5 5
I would like to be able to return a contingency table showing like this:
>1 2 3 4 5
2 1 4 1 2
Ive attempted examples featured on here for similar issues, but I get the sums returned instead of a count:
library(plyr)
colSums(rbind.fill(data.frame(t(unclass(df$ab))), data.frame(t(unclass(df$ba)))),`
na.rm = T)
Any help is greatly appreciated
We unlist the data.frame into a vector and apply table in base R
table(unlist(df))
# 1 2 3 4 5
# 2 1 4 1 2
Or with tidyverse, by reshaping the data into 'long' format with pivot_longer and get the count
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = everything()) %>%
count(value)
data
df <- structure(list(ab = 1:5, ba = c(1L, 3L, 3L, 3L, 5L)),
class = "data.frame", row.names = c(NA,
-5L))

Is there an R function to count strings by re-arranging the table [duplicate]

I am trying to obtain counts of each combination of levels of two variables, "week" and "id". I'd like the result to have "id" as rows, and "week" as columns, and the counts as the values.
Example of what I've tried so far (tried a bunch of other things, including adding a dummy variable = 1 and then fun.aggregate = sum over that):
library(plyr)
ddply(data, .(id), dcast, id ~ week, value_var = "id",
fun.aggregate = length, fill = 0, .parallel = TRUE)
However, I must be doing something wrong because this function is not finishing. Is there a better way to do this?
Input:
id week
1 1
1 2
1 3
1 1
2 3
Output:
1 2 3
1 2 1 1
2 0 0 1
You could just use the table command:
table(data$id,data$week)
1 2 3
1 2 1 1
2 0 0 1
If "id" and "week" are the only columns in your data frame, you can simply use:
table(data)
# week
# id 1 2 3
# 1 2 1 1
# 2 0 0 1
You don't need ddply for this. The dcast from reshape2 is sufficient:
dat <- data.frame(
id = c(rep(1, 4), 2),
week = c(1:3, 1, 3)
)
library(reshape2)
dcast(dat, id~week, fun.aggregate=length)
id 1 2 3
1 1 2 1 1
2 2 0 0 1
Edit : For a base R solution (other than table - as posted by Joshua Uhlrich), try xtabs:
xtabs(~id+week, data=dat)
week
id 1 2 3
1 2 1 1
2 0 0 1
The reason ddply is taking so long is that the splitting by group is not run in parallel (only the computations on the 'splits'), therefore with a large number of groups it will be slow (and .parallel = T) will not help.
An approach using data.table::dcast (data.table version >= 1.9.2) should be extremely efficient in time and memory. In this case, we can rely on default argument values and simply use:
library(data.table)
dcast(setDT(data), id ~ week)
# Using 'week' as value column. Use 'value.var' to override
# Aggregate function missing, defaulting to 'length'
# id 1 2 3
# 1: 1 2 1 1
# 2: 2 0 0 1
Or setting the arguments explicitly:
dcast(setDT(data), id ~ week, value.var = "week", fun = length)
# id 1 2 3
# 1: 1 2 1 1
# 2: 2 0 0 1
For pre-data.table 1.9.2 alternatives, see edits.
A tidyverse option could be :
library(dplyr)
library(tidyr)
df %>%
count(id, week) %>%
pivot_wider(names_from = week, values_from = n, values_fill = list(n = 0))
#spread(week, n, fill = 0) #In older version of tidyr
# id `1` `2` `3`
# <dbl> <dbl> <dbl> <dbl>
#1 1 2 1 1
#2 2 0 0 1
Using only pivot_wider -
tidyr::pivot_wider(df, names_from = week,
values_from = week, values_fn = length, values_fill = 0)
Or using tabyl from janitor :
janitor::tabyl(df, id, week)
# id 1 2 3
# 1 2 1 1
# 2 0 0 1
data
df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L), week = c(1L, 2L, 3L,
1L, 3L)), class = "data.frame", row.names = c(NA, -5L))

How do I convert a specific column in my R dataframe from long to wide and display the counts and percentages? [duplicate]

I am trying to obtain counts of each combination of levels of two variables, "week" and "id". I'd like the result to have "id" as rows, and "week" as columns, and the counts as the values.
Example of what I've tried so far (tried a bunch of other things, including adding a dummy variable = 1 and then fun.aggregate = sum over that):
library(plyr)
ddply(data, .(id), dcast, id ~ week, value_var = "id",
fun.aggregate = length, fill = 0, .parallel = TRUE)
However, I must be doing something wrong because this function is not finishing. Is there a better way to do this?
Input:
id week
1 1
1 2
1 3
1 1
2 3
Output:
1 2 3
1 2 1 1
2 0 0 1
You could just use the table command:
table(data$id,data$week)
1 2 3
1 2 1 1
2 0 0 1
If "id" and "week" are the only columns in your data frame, you can simply use:
table(data)
# week
# id 1 2 3
# 1 2 1 1
# 2 0 0 1
You don't need ddply for this. The dcast from reshape2 is sufficient:
dat <- data.frame(
id = c(rep(1, 4), 2),
week = c(1:3, 1, 3)
)
library(reshape2)
dcast(dat, id~week, fun.aggregate=length)
id 1 2 3
1 1 2 1 1
2 2 0 0 1
Edit : For a base R solution (other than table - as posted by Joshua Uhlrich), try xtabs:
xtabs(~id+week, data=dat)
week
id 1 2 3
1 2 1 1
2 0 0 1
The reason ddply is taking so long is that the splitting by group is not run in parallel (only the computations on the 'splits'), therefore with a large number of groups it will be slow (and .parallel = T) will not help.
An approach using data.table::dcast (data.table version >= 1.9.2) should be extremely efficient in time and memory. In this case, we can rely on default argument values and simply use:
library(data.table)
dcast(setDT(data), id ~ week)
# Using 'week' as value column. Use 'value.var' to override
# Aggregate function missing, defaulting to 'length'
# id 1 2 3
# 1: 1 2 1 1
# 2: 2 0 0 1
Or setting the arguments explicitly:
dcast(setDT(data), id ~ week, value.var = "week", fun = length)
# id 1 2 3
# 1: 1 2 1 1
# 2: 2 0 0 1
For pre-data.table 1.9.2 alternatives, see edits.
A tidyverse option could be :
library(dplyr)
library(tidyr)
df %>%
count(id, week) %>%
pivot_wider(names_from = week, values_from = n, values_fill = list(n = 0))
#spread(week, n, fill = 0) #In older version of tidyr
# id `1` `2` `3`
# <dbl> <dbl> <dbl> <dbl>
#1 1 2 1 1
#2 2 0 0 1
Using only pivot_wider -
tidyr::pivot_wider(df, names_from = week,
values_from = week, values_fn = length, values_fill = 0)
Or using tabyl from janitor :
janitor::tabyl(df, id, week)
# id 1 2 3
# 1 2 1 1
# 2 0 0 1
data
df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L), week = c(1L, 2L, 3L,
1L, 3L)), class = "data.frame", row.names = c(NA, -5L))

How to convert categories of variables into column and row headings in R [duplicate]

I am trying to obtain counts of each combination of levels of two variables, "week" and "id". I'd like the result to have "id" as rows, and "week" as columns, and the counts as the values.
Example of what I've tried so far (tried a bunch of other things, including adding a dummy variable = 1 and then fun.aggregate = sum over that):
library(plyr)
ddply(data, .(id), dcast, id ~ week, value_var = "id",
fun.aggregate = length, fill = 0, .parallel = TRUE)
However, I must be doing something wrong because this function is not finishing. Is there a better way to do this?
Input:
id week
1 1
1 2
1 3
1 1
2 3
Output:
1 2 3
1 2 1 1
2 0 0 1
You could just use the table command:
table(data$id,data$week)
1 2 3
1 2 1 1
2 0 0 1
If "id" and "week" are the only columns in your data frame, you can simply use:
table(data)
# week
# id 1 2 3
# 1 2 1 1
# 2 0 0 1
You don't need ddply for this. The dcast from reshape2 is sufficient:
dat <- data.frame(
id = c(rep(1, 4), 2),
week = c(1:3, 1, 3)
)
library(reshape2)
dcast(dat, id~week, fun.aggregate=length)
id 1 2 3
1 1 2 1 1
2 2 0 0 1
Edit : For a base R solution (other than table - as posted by Joshua Uhlrich), try xtabs:
xtabs(~id+week, data=dat)
week
id 1 2 3
1 2 1 1
2 0 0 1
The reason ddply is taking so long is that the splitting by group is not run in parallel (only the computations on the 'splits'), therefore with a large number of groups it will be slow (and .parallel = T) will not help.
An approach using data.table::dcast (data.table version >= 1.9.2) should be extremely efficient in time and memory. In this case, we can rely on default argument values and simply use:
library(data.table)
dcast(setDT(data), id ~ week)
# Using 'week' as value column. Use 'value.var' to override
# Aggregate function missing, defaulting to 'length'
# id 1 2 3
# 1: 1 2 1 1
# 2: 2 0 0 1
Or setting the arguments explicitly:
dcast(setDT(data), id ~ week, value.var = "week", fun = length)
# id 1 2 3
# 1: 1 2 1 1
# 2: 2 0 0 1
For pre-data.table 1.9.2 alternatives, see edits.
A tidyverse option could be :
library(dplyr)
library(tidyr)
df %>%
count(id, week) %>%
pivot_wider(names_from = week, values_from = n, values_fill = list(n = 0))
#spread(week, n, fill = 0) #In older version of tidyr
# id `1` `2` `3`
# <dbl> <dbl> <dbl> <dbl>
#1 1 2 1 1
#2 2 0 0 1
Using only pivot_wider -
tidyr::pivot_wider(df, names_from = week,
values_from = week, values_fn = length, values_fill = 0)
Or using tabyl from janitor :
janitor::tabyl(df, id, week)
# id 1 2 3
# 1 2 1 1
# 2 0 0 1
data
df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L), week = c(1L, 2L, 3L,
1L, 3L)), class = "data.frame", row.names = c(NA, -5L))

calculate timeline for different subjects in dataframe

I have data like
subject date number
1 1/2/01 4
1 3/2/01 6
1 10/2/01 7
2 1/1/01 2
2 4/1/01 3
I want to get R to work out the number of days since the first sample for each subject. eg:
Subject days
1 0
1 2
1 9
2 0
2 3
How can I do this? I have converted the dates using lubridate.
SOmething like:
for(i in 1:nrow(data)){
if(data$date[i] != data$date[i -1]) {
data$timeline <- data$date[i] - data$date[i-1]
}
}
I get the error:
argument is of length 0 - I think the problem is the first line where there is no preceeding row..?
I would use dplyr to do some grouping and data manipulation. Note that we first have to convert your date into something R will recognize as a date.
library(dplyr)
dat$Date <- as.Date(dat$date, '%d/%m/%y')
dat %>%
group_by(subject) %>%
mutate(days = Date - min(Date))
# subject date number Date days
# <int> <chr> <int> <date> <time>
# 1 1 1/2/01 4 2001-02-01 0
# 2 1 3/2/01 6 2001-02-03 2
# 3 1 10/2/01 7 2001-02-10 9
# 4 2 1/1/01 2 2001-01-01 0
# 5 2 4/3/01 3 2001-03-04 62
here's the data:
dat <- structure(list(subject = c(1L, 1L, 1L, 2L, 2L), date = c("1/2/01",
"3/2/01", "10/2/01", "1/1/01", "4/3/01"), number = c(4L, 6L,
7L, 2L, 3L), Date = structure(c(11354, 11356, 11363, 11323, 11385
), class = "Date")), .Names = c("subject", "date", "number",
"Date"), row.names = c(NA, -5L), class = "data.frame")
Using the input shown in the note convert the date column to Date class (assuming that it is in the form dd/mm/yy) and then use ave to subtract the least date from all the dates for each subject. If the input is sorted as in the question we could optionally use x[1] instead of min(x). No packages are used.
data$date <- as.Date(data$date, "%d/%m/%y")
diff1 <- function(x) x - min(x)
with(data, data.frame(subject, days = ave(as.numeric(date), subject, FUN = diff1)))
giving:
subject days
1 1 0
2 1 2
3 1 9
4 2 0
5 2 62
Note
The input used, in reproducible form, is:
Lines <- "
subject date number
1 1/2/01 4
1 3/2/01 6
1 10/2/01 7
2 1/1/01 2
2 4/3/01 3"
data <- read.table(text = Lines, header = TRUE)

Resources