Performing calculations on binned counts in R - r

I have a dataset stored in a text file in the format of bins of values followed by counts, like this:
var_a 1:5 5:12 7:9 9:14 ...
indicating that var_a took on the value 1 5 times in the dataset, 5 12 times, etc. Each variable is on its own line in that format.
I'd like to be able to perform calculations on this dataset in R, like quantiles, variance, and so on. Is there an easy way to load the data from the file and calculate these statistics? Ultimately I'd like to make a box-and-whisker plot for each variable.
Cheers!

You could use readLines to read in the data file
.x <- readLines(datafile)
I will create some dummy data, as I don't have the file. This should be the equivalent of the output of readLines
## dummy
.x <- c("var_a 1:5 5:12 7:9 9:14", 'var_b 1:5 2:12 3:9 4:14')
I split by spacing to get each
#split by space
space_split <- strsplit(.x, ' ')
# get the variable names (first in each list)
variable_names <- lapply(space_split,'[[',1)
# get the variable contents (everything but the first element in each list)
variable_contents <- lapply(space_split,'[',-1)
# a function to do the appropriate replicates
do_rep <- function(x){rep.int(x[1],x[2])}
# recreate the variables
variables <- lapply(variable_contents, function(x){
.list <- strsplit(x, ':')
unlist(lapply(lapply(.list, as.numeric), do_rep))
})
names(variables) <- variable_names
you could get the variance for each variable using
lapply(variables, var)
## $var_a
## [1] 6.848718
##
## $var_b
## [1] 1.138462
or get boxplots
boxplot(variables, ~.)

Not knowing the actual form that your data is in, I would probably use something like readLines to get each line in as a vector, then do something like the following:
# Some sample data
temp = c("var_a 1:5 5:12 7:9 9:14",
"var_b 1:7 4:9 3:11 2:10",
"var_c 2:5 5:14 6:6 3:14")
# Extract the names
NAMES = gsub("[0-9: ]", "", temp)
# Extract the data
temp_1 = strsplit(temp, " |:")
temp_1 = lapply(temp_1, function(x) as.numeric(x[-1]))
# "Expand" the data
temp_1 = lapply(1:length(temp_1),
function(x) rep(temp_1[[x]][seq(1, length(temp_1[[x]]), by=2)],
temp_1[[x]][seq(2, length(temp_1[[x]]), by=2)]))
names(temp_1) = NAMES
temp_1
# $var_a
# [1] 1 1 1 1 1 5 5 5 5 5 5 5 5 5 5 5 5 7 7 7 7 7 7 7 7 7 9 9 9 9 9 9 9 9 9 9 9 9 9 9
#
# $var_b
# [1] 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2
#
# $var_c
# [1] 2 2 2 2 2 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 3 3 3 3 3 3 3 3 3 3 3 3 3 3

Related

How to vectorize the RHS of dplyr::case_when?

Suppose I have a dataframe that looks like this:
> data <- data.frame(x = c(1,1,2,2,3,4,5,6), y = c(1,2,3,4,5,6,7,8))
> data
x y
1 1 1
2 1 2
3 2 3
4 2 4
5 3 5
6 4 6
7 5 7
8 6 8
I want to use mutate and case_when to create a new id variable that will identify rows using the variable x, and give rows missing x a unique id. In other words, I should have the same id for rows one and two, rows three and four, while rows 5-8 should have their own unique ids. Suppose I want to generate these id values with a function:
id_function <- function(x, n){
set.seed(x)
res <- character(n)
for(i in seq(n)){
res[i] <- paste0(sample(c(letters, LETTERS, 0:9), 32), collapse="")
}
res
}
id_function(1, 1)
[1] "4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf"
I am trying to use this function on the RHS of a case_when expression like this:
data %>%
mutate(my_id = id_function(1234, nrow(.)),
my_id = dplyr::case_when(!is.na(x) ~ id_function(x, 1),
TRUE ~ my_id))
But the RHS does not seem to be vectorized and I get the same value for all non-missing values of x:
x y my_id
1 1 1 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
2 1 2 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
3 2 3 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
4 2 4 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
5 NA 5 0vnws5giVNIzp86BHKuOZ9ch4dtL3Fqy
6 NA 6 IbKU6DjvW9ypitl7qc25Lr4sOwEfghdk
7 NA 7 8oqQMPx6IrkGhXv4KlUtYfcJ5Z1RCaDy
8 NA 8 BRsjumlCEGS6v4ANrw1bxLynOKkF90ao
I'm sure there's a way to vectorize the RHS, what am I doing wrong? Is there an easier approach to solving this problem?
I guess rowwise() would do the trick:
data %>%
rowwise() %>%
mutate(my_id = id_function(x, 1))
x y my_id
1 1 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
1 2 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
2 3 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
2 4 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
3 5 e5lMJNQEhtj4VY1KbCR9WUiPrpy7vfXo
4 6 3kYcgR7109DLbxatQIAKXFeovN8pnuUV
5 7 bQ4ok7OuDgscLUlpzKAivBj2T3m6wrWy
6 8 0jSn3Jcb2HDA5uhvG8g1ytsmRpl6CQWN
purrr map functions can be used for non-vectorized functions. The following will give you a similar result. map2 will take the two arguments expected by your id_function.
library(tidyverse)
data %>%
mutate(my_id = map2(x, 1, id_function))
Output
x y my_id
1 1 1 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
2 1 2 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
3 2 3 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
4 2 4 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
5 3 5 e5lMJNQEhtj4VY1KbCR9WUiPrpy7vfXo
6 4 6 3kYcgR7109DLbxatQIAKXFeovN8pnuUV
7 5 7 bQ4ok7OuDgscLUlpzKAivBj2T3m6wrWy
8 6 8 0jSn3Jcb2HDA5uhvG8g1ytsmRpl6CQWN

How to split a dataframe into chunks when a particular column value occurs (R)? [duplicate]

This question already has answers here:
Split data.frame by value
(2 answers)
Closed 3 years ago.
I am trying to split a dataframe into chunks, based on a particular value in a column (rather than a grouping value), so every time the column matches this value, it should chunk the dataframe. For example, with dataframe x:
f1 f2
3 0
4 1
5 2
6 0
7 1
8 2
9 3
How would I split x to be a list of, where the split occurs anytime "f2"==0:
[1]
f1 f2
3 0
4 1
5 2
[2]
f1 f2
6 0
7 1
8 2
9 3
I have tried
split(x, x$f2 == 0)
which just creates a list of two elements, one where x x$f2 == 0 is FALSE and one where x$f2 == 0 is TRUE.
I have also tried to use apply() as in
mm <- apply(x, function(x) split(x$f2 == 0))
but I get the error "Error in match.fun(FUN) : argument "FUN" is missing, with no default"
Code to make a simple dataframe as above:
f1 <- c(3,4,5,6,7,8,9)
f2 <- c(0,1,2,0,1,2,3)
x <- data.frame(f1,f2)
Using base R's split with for example cumsum like this would be a way:
split(x, cumsum(x$f2 == 0))
Output
# $`1`
# f1 f2
# 1 3 0
# 2 4 1
# 3 5 2
#
# $`2`
# f1 f2
# 4 6 0
# 5 7 1
# 6 8 2
# 7 9 3
With dplyr, you can do (basically the same thing as the idea by #jogo):
df %>%
group_split(cumsum(f2 == 0), keep = FALSE)
[[1]]
# A tibble: 3 x 2
f1 f2
<int> <int>
1 3 0
2 4 1
3 5 2
[[2]]
# A tibble: 4 x 2
f1 f2
<int> <int>
1 6 0
2 7 1
3 8 2
4 9 3

Need help concatenating column names

I am generating 5 different prediction and adding those predictions to an existing data frame. My code is:
For j in i{
…
actual.predicted <- data.frame(test_data, predicted)
}
I am trying to concatenate words together to create new column names, in the loop. Specifically, I have a column named “predicted” and I am generating predictions in each iteration of the loop. So, in the first iteration, I want the new column name to be “predicted.1” and for the second iteration, the new column name should be “predicted.2” and so on.
Any thoughts would be greatly appreciated.
You may not even need to use a loop here, but assuming you do, one pattern which might work well here would be to use a list:
results <- list()
for j in i {
# do something involving j
name <- paste0("predicted.", j)
results[[name]] <- data.frame(test_data, predicted)
}
One option is to set the names after assigning new columns
actual.predicted <- data.frame(orig_col = sample(10))
for (j in 1:5){
new_col = sample(10)
actual.predicted <- cbind(actual.predicted, new_col)
names(actual.predicted)[length(actual.predicted)] <- paste0('predicted.',j)
}
actual.predicted
# orig_col predicted.1 predicted.2 predicted.3 predicted.4 predicted.5
# 1 1 4 4 9 1 5
# 2 10 2 3 7 5 9
# 3 8 6 5 4 2 3
# 4 5 9 9 10 7 7
# 5 2 1 10 8 3 10
# 6 9 7 6 6 8 6
# 7 7 8 7 2 4 2
# 8 3 3 1 1 6 8
# 9 6 10 2 3 9 4
# 10 4 5 8 5 10 1

Repeat vector to fill down column in data frame

Seems like this very simple maneuver used to work for me, and now it simply doesn't. A dummy version of the problem:
df <- data.frame(x = 1:5) # create simple dataframe
df
x
1 1
2 2
3 3
4 4
5 5
df$y <- c(1:5) # adding a new column with a vector of the exact same length. Works out like it should
df
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
df$z <- c(1:4) # trying to add a new colum, this time with a vector with less elements than there are rows in the dataframe.
Error in `$<-.data.frame`(`*tmp*`, "z", value = 1:4) :
replacement has 4 rows, data has 5
I was expecting this to work with the following result:
x y z
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 1
I.e. the shorter vector should just start repeating itself automatically. I'm pretty certain this used to work for me (it's in a script that I've been running a hundred times before without problems). Now I can't even get the above dummy example to work like I want to. What am I missing?
If the vector can be evenly recycled, into the data.frame, you do not get and error or a warning:
df <- data.frame(x = 1:10)
df$z <- 1:5
This may be what you were experiencing before.
You can get your vector to fit as you mention with rep_len:
df$y <- rep_len(1:3, length.out=10)
This results in
df
x z y
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 1
5 5 5 2
6 6 1 3
7 7 2 1
8 8 3 2
9 9 4 3
10 10 5 1
Note that in place of rep_len, you could use the more common rep function:
df$y <- rep(1:3,len=10)
From the help file for rep:
rep.int and rep_len are faster simplified versions for two common cases. They are not generic.
If the total number of rows is a multiple of the length of your new vector, it works fine. When it is not, it does not work everywhere. In particular, probably you have used this type of recycling with matrices:
data.frame(1:6, 1:3, 1:4) # not a multiply
# Error in data.frame(1:6, 1:3, 1:4) :
# arguments imply differing number of rows: 6, 3, 4
data.frame(1:6, 1:3) # a multiple
# X1.6 X1.3
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 1
# 5 5 2
# 6 6 3
cbind(1:6, 1:3, 1:4) # works even with not a multiple
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] 2 2 2
# [3,] 3 3 3
# [4,] 4 1 4
# [5,] 5 2 1
# [6,] 6 3 2
# Warning message:
# In cbind(1:6, 1:3, 1:4) :
# number of rows of result is not a multiple of vector length (arg 3)

R - Subset dataframe to include only subjects with more than 1 record

I'd like to subset a dataframe to include all records for subjects that have >1 record, and exclude those subjects with only 1 record.
Let's take the following dataframe;
mydata <- data.frame(subject_id = factor(c(1,2,3,4,4,5,5,6,6,7,8,9,9,9,10)),
variable = rnorm(15))
The code below gives me the subjects with >1 record using duplicated();
duplicates <- mydata[duplicated(mydata$subject_id),]$subject_id
But I want to retain in my subset all records for each subject with >1 record, so I tried;
mydata[mydata$subject_id==as.factor(duplicates),]
Which does not return the result I'm expecting.
Any ideas?
A data.table solution
set.seed(20)
subject_id <- as.factor(c(1,2,3,4,4,5,5,6,6,7,8,9,9,9,10))
variable <- rnorm(15)
mydata<-as.data.frame(cbind(subject_id, variable))
library(data.table)
setDT(mydata)[, .SD[.N > 1], by = subject_id] # #Thanks David.
# subject_id variable
# 1: 4 -1.3325937
# 2: 4 -0.4465668
# 3: 5 0.5696061
# 4: 5 -2.8897176
# 5: 6 -0.8690183
# 6: 6 -0.4617027
# 7: 9 -0.1503822
# 8: 9 -0.6281268
# 9: 9 1.3232209
A simple alternative is to use dplyr:
library(dplyr)
dfr <- data.frame(a=sample(1:2,10,rep=T), b=sample(1:5,10, rep=T))
dfr <- group_by(dfr, b)
dfr
# Source: local data frame [10 x 2]
# Groups: b
#
# a b
# 1 2 4
# 2 2 2
# 3 2 5
# 4 2 1
# 5 1 2
# 6 1 3
# 7 2 1
# 8 2 4
# 9 1 4
# 10 2 4
filter(dfr, n() > 1)
# Source: local data frame [8 x 2]
# Groups: b
#
# a b
# 1 2 4
# 2 2 2
# 3 2 1
# 4 1 2
# 5 2 1
# 6 2 4
# 7 1 4
# 8 2 4
Here you go (I changed your variable to var <- rnorm(15):
set.seed(11)
subject_id<-as.factor(c(1,2,3,4,4,5,5,6,6,7,8,9,9,9,10))
var<-rnorm(15)
mydata<-as.data.frame(cbind(subject_id,var))
x1 <- c(names(table(mydata$subject_id)[table(mydata$subject_id) > 1]))
x2 <- which(mydata$subject_id %in% x1)
mydata[x2,]
subject_id var
4 4 0.3951076
5 4 -2.4129058
6 5 -1.3309979
7 5 -1.7354382
8 6 0.4020871
9 6 0.4628287
12 9 -2.1744466
13 9 0.4857337
14 9 1.0245632
Try:
> mydata[mydata$subject_id %in% mydata[duplicated(mydata$subject_id),]$subject_id,]
subject_id variable
4 4 -1.3325937
5 4 -0.4465668
6 5 0.5696061
7 5 -2.8897176
8 6 -0.8690183
9 6 -0.4617027
12 9 -0.1503822
13 9 -0.6281268
14 9 1.3232209
I had to edit your data frame a little bit:
set.seed(20)
subject_id <- as.factor(c(1,2,3,4,4,5,5,6,6,7,8,9,9,9,10))
variable <- rnorm(15)
mydata<-as.data.frame(cbind(subject_id, variable))
Now to get all the rows for subjects that appear more than once:
mydata[duplicated(mydata$subject_id)
| duplicated(mydata$subject_id, fromLast = TRUE), ]
# subject_id variable
# 4 4 -1.3325937
# 5 4 -0.4465668
# 6 5 0.5696061
# 7 5 -2.8897176
# 8 6 -0.8690183
# 9 6 -0.4617027
# 12 9 -0.1503822
# 13 9 -0.6281268
# 14 9 1.3232209
Edit: this would also work, using your duplicates vector:
mydata[mydata$subject_id %in% duplicates, ]

Resources