Display an evenly-spaced sample of rows from a data frame in R - r

Instead of looking at the first n rows of a data frame, as head(mydf) does, or the last n as tail(mydf) does, it occurs to me that I would often rather see n evenly-spaced rows, including the first and the last row. For example, if a data frame had 601 rows, this hypothetical function would display row 1, 101, 201, 301, 401, 501, and 601, assuming that 6 is the default number, as it is for head() and tail().
Is there a built-in function of some package that does this, and if not what would be the best way to implement?
For example, for the data frame mydf <- data.frame(name=letters, value=101:126), I would want the output of an alternative to head() called myview() to be something like:
> myview(mydf)
name value
1 a 101
6 f 106
11 k 111
16 p 116
21 u 121
26 z 126

You can directly do this in seq :
looksee <- function(df, n = 6) df[seq(1, nrow(df), length.out = n),]
looksee(mydf)
# name value
#1 a 101
#6 f 106
#11 k 111
#16 p 116
#21 u 121
#26 z 126
looksee(mydf, 10)
# name value
#1 a 101
#3 c 103
#6 f 106
#9 i 109
#12 l 112
#14 n 114
#17 q 117
#20 t 120
#23 w 123
#26 z 126

This is my try at implementing, but it is probably not very robust compared to head()--it will only work for things that nrow() works for, for one thing.
looksee <- function(df, n=6){
q <- seq(0, 1, length.out=n)
n = nrow(df)
rows <- round(quantile(1:n, probs=q))
return(df[rows,])
}
Example usage:
> mydf <- data.frame(name=letters, value=101:126)
> looksee(mydf)
name value
1 a 101
6 f 106
11 k 111
16 p 116
21 u 121
26 z 126

Related

Writing a function to compare differences of a series of numeric variables

I am working on a problem set and absolutely cannot figure this one out. I think I've fried my brain to the point where it doesn't even make sense anymore.
Here is a look at the data ...
sex age chol tg ht wt sbp dbp vldl hdl ldl bmi
<chr> <int> <int> <int> <dbl> <dbl> <int> <int> <int> <int> <int> <dbl>
1 M 60 137 50 68.2 112. 110 70 10 53 74 2.40
2 M 26 154 202 82.8 185. 88 64 34 31 92 2.70
3 M 33 198 108 64.2 147 120 80 22 34 132 3.56
4 F 27 154 47 63.2 129 110 76 9 57 88 3.22
5 M 36 212 79 67.5 176. 130 100 16 37 159 3.87
6 F 31 197 90 64.5 121 122 78 18 58 111 2.91
7 M 28 178 163 66.5 167 118 68 19 30 135 3.78
8 F 28 146 60 63 105. 120 80 12 46 88 2.64
9 F 25 231 165 64 126 130 72 23 70 137 3.08
10 M 22 163 30 68.8 173 112 70 6 50 107 3.66
# … with 182 more rows
I must write a function, myTtest, to perform the following task:
Perform a two-sample t-tests to compare the differences of a series of numeric variables between each level of a classification variable
The first argument, dat, is a data frame
The second argument, classVar, is a character vector of length 1. It is the name of the classification variable, such as 'sex.'
The third argument, numVar, is a character vector that contains the name of the numeric variables, such as c("age", "chol", "tg"). This means I need to perform three t-tests to compare the difference of those between males and females.
The function should return a data frame with the following variables: Varname, F.mean, M.mean, t (for t-statistics), df (for degrees of freedom), and p (for p-value).
I should be able to run this ...
myTtest(dat = chol, classVar = "sex", numVar = c("age", "chol", "tg")
... and then get the data frame to appear.
Any help is greatly appreciated. I am pulling my hair out over this one! As well, as noted in my comment below, this has to be done without Tidyverse ... which is why I'm having so much trouble to begin with.
The intuition for this solution is that you can loop over your dependent variables, and call t.test() in each loop. Then save the results from each DV and stack them together in one big data frame.
I'll leave out some bits for you to fill in, but here's the gist:
First, some example data:
set.seed(123)
n <- 20
grp <- sample(c("m", "f"), n, replace = TRUE)
df <- data.frame(grp = grp, age = rnorm(n), chol = rnorm(n), tg = rnorm(n))
df
grp age chol tg
1 m 1.2240818 0.42646422 0.25331851
2 m 0.3598138 -0.29507148 -0.02854676
3 m 0.4007715 0.89512566 -0.04287046
4 f 0.1106827 0.87813349 1.36860228
5 m -0.5558411 0.82158108 -0.22577099
6 f 1.7869131 0.68864025 1.51647060
7 f 0.4978505 0.55391765 -1.54875280
8 f -1.9666172 -0.06191171 0.58461375
9 m 0.7013559 -0.30596266 0.12385424
10 m -0.4727914 -0.38047100 0.21594157
Now make a container that each of the model outputs will go into:
fits_df <- data.frame()
Loop over each DV and append the model output to fits_df each time with rbind:
for (dv in c("age", "chol", "tg")) {
frml <- as.formula(paste0(dv, " ~ grp")) # make a model formula: dv ~ grp
fit <- t.test(frml, two.sided = TRUE, data = df) # perform the t-test
# hint: use str(fit) to figure out how to pull out each value you care about
fit_df <- data.frame(
dv = col,
f_mean = xxx,
m_mean = xxx,
t = xxx,
df = xxx,
p = xxx
)
fits_df <- rbind(fits_df, fit_df)
}
Your output will look like this:
fits_df
dv f_mean m_mean t df p
1 age -0.18558068 -0.04446755 -0.297 15.679 0.7704954
2 chol 0.07731514 0.22158672 -0.375 17.828 0.7119400
3 tg 0.09349567 0.23693052 -0.345 14.284 0.7352112
One note: When you're pulling out values from fit, you may get odd row names in your output data frame. This is due to the names property of the various fit attributes. You can get rid of these by using as.numeric() or as.character() wrappers around the values you pull from fit (for example, fit$statistic can be cleaned up with as.character(round(fit$statistic, 3))).

Convert list of lists to single dataframe with first column filled by first value (for each list) in R

I have a list of lists, like so:
x <-list()
x[[1]] <- c('97', '342', '333')
x[[2]] <- c('97','555','556','742','888')
x[[3]] <- c ('100', '442', '443', '444', '445','446')
The first number in each list (97, 97, 100) refers to a node in a tree and the following numbers refer to traits associated with that node.
My goal is to create a dataframe that looks like this:
df= data.frame(node = c('97','97','97','97','97','97','100','100','100','100','100'),
trait = c('342','333','555','556','742','888','442','443','444','445','446'))
where each trait has its corresponding node.
I think the first thing I need to do is convert the list of lists into a single dataframe. I've tried doing so using:
do.call(rbind,x)
but that repeats the values in x[[1]] and x[[2]] to match the length of x[[3]]. I've also tried using:
dt_list <- map(x, as.data.table)
dt <- rbindlist(dt_list, fill = TRUE, idcol = T)
Which I think gets me closer, but I'm still unsure of how to assign the first node value to the corresponding trait values. I know this is probably a simple task but it's stumping me today!
Maybe you can try the code below
h <- sapply(x, `[`,1)
d <- lapply(x, `[`,-1)
df <- data.frame(node = rep(h,lengths(d)), trait = unlist(d))
such that
> df
node trait
1 97 342
2 97 333
3 97 555
4 97 556
5 97 742
6 97 888
7 100 442
8 100 443
9 100 444
10 100 445
11 100 446
You can create a data frame with the first value from the vector in column 'node' and the rest of the values in column 'trait'. This strategy can be applied to all entries in the list using the map_df() function from purrr package, giving the output you describe.
library(purrr)
library(dplyr)
x %>%
map_df(., function(vec) data.frame(node = vec[1],
trait = vec[-1],
stringsAsFactors = F))
An option with base R is
stack(setNames(lapply(x, `[`, -1), sapply(x, `[`, 1)))[2:1]
# ind values
#1 97 342
#2 97 333
#3 97 555
#4 97 556
#5 97 742
#6 97 888
#7 100 442
#8 100 443
#9 100 444
#10 100 445
#11 100 446
Another solution
library(tidyverse)
library(purrr)
node <- map(x, ~rep(.x[1], length(.x)-1)) %>% flatten_chr()
trait <- map(x, ~.x[2:length(.x)]) %>% flatten_chr()
out <- tibble(node, trait)
node trait
<chr> <chr>
1 97 342
2 97 333
3 97 555
4 97 556
5 97 742
6 97 888
7 100 442
8 100 443
9 100 444
10 100 445
11 100 446

Average over rows pairs and paste the value based on condition

In R, I have a df such as:
a b c
1 124 70 aa
2 129 67 aa
3 139 71 aa
4 125 77 aa
5 125 82 aa
6 121 69 aa
7 135 68 bb
8 137 72 bb
9 137 78 bb
10 140 86 bb
I want to iterate along rows within columns (a, b), computing the mean of all rows pairs, and paste this mean to the same two rows of new columns (a_new, b_new) if the difference between these two rows is >=12. Otherwise just copy the old value. This behaviour should be restricted to groups as marked by another column (c), i.e it should not happen if two rows are from different groups.
In this example, it happens in row 3 (cos in column a, difference with next (4th) row is 14) and in row 5 (cos in column b, difference with next row is 13). However, this should not happen with row 6 cos row 7 is in another c group.
Thus, resulting df would look like:
a b c a_new b_new
1 124 70 aa 124 70
2 129 67 aa 129 67
3 139 71 aa 132 71
4 125 77 aa 132 68
5 125 82 aa 125 75.5
6 121 69 aa 121 75.5
7 135 68 bb 135 68
8 137 72 bb 137 72
9 137 78 bb 137 78
10 140 86 bb 140 86
I've been struggling to do this for a while, figured out that perhaps lag function could be used, but no success. Help would be much appreciated (be it base R, or dplyr, or whatever)
Dput:
structure(list(a = c(124, 129, 139, 125, 125, 121, 135, 137,
137, 140), b = c(70, 67, 71, 77, 82, 69, 68, 72, 78, 86), c = c("aa",
"aa", "aa", "aa", "aa", "aa", "bb", "bb", "bb", "bb")), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
We can write a function which works for one chunk.
apply_fun <- function(x) {
inds <- which(abs(diff(x)) >= 12)
if(length(inds))
x[sort(c(inds, inds + 1))] <- c(sapply(inds, function(i)
rep(mean(x[c(i, i + 1)]), 2)))
return(x)
}
and then apply it for multiple columns by group.
library(dplyr)
df %>% group_by(c) %>% mutate_at(vars(a, b), list(new = apply_fun))
# a b c a_new b_new
# <dbl> <dbl> <chr> <dbl> <dbl>
# 1 124 70 aa 124 70
# 2 129 67 aa 129 67
# 3 139 71 aa 132 71
# 4 125 77 aa 132 77
# 5 125 82 aa 125 75.5
# 6 121 69 aa 121 75.5
# 7 135 68 bb 135 68
# 8 137 72 bb 137 72
# 9 137 78 bb 137 78
#10 140 86 bb 140 86
What I understood is to apply to each group given by the indicator column "c" the procedure commented in the code below:
pairAverage <- function(x) {
# x should be a numeric vector of length > 1
if (is.vector(x) & is.numeric(x) & length(x) > 1) {
# copy data to an aux vector
aux <- x
# get differences of lag 1
dh<-diff(x, 1)
# get means of consecutive pairs
med <- c(x$a[2:length(x)] - dh/2)
# get positions (index) of abs(means) >= 12
idx <- match(med[abs(dh) >= 12], med)
# need 2 reps of each mean to replace consecutive values of x
valToRepl <- med[sort(rep(idx,2))]
# ordered indexes pairs of consecutive elements of x to be replaced
idxToRepl <- sort(c(idx,idx+1))
# replace pairs of values
aux[idxToRepl] <- valToRepl
return(aux)
} else {
# do nothing
warning("paramater x should be a numeric vector of length > 1")
return(NULL)
}
}
pairAverageByGroups <- function(x, gr) {
if (is.vector(x) & is.numeric(x) & length(x) == length(gr)) {
x.ls <- split(x, as.factor(gr))
output <- unlist(lapply(x.ls, pairAverage))
names(output) <- NULL
output
} else {
# do nothing
warning("paremater x should be a numeric vector of length > 1")
return(NULL)
}
}
pairAverageByGroups(dd$a, dd$c)
[1] 124 129 132 132 125 121 135 137 137 140

Subset Columns based on partial matching of column names in the same data frame

I would like to understand how to subset multiple columns from same data frame by matching the first 5 letters of the column names with each other and if they are equal then subset it and store it in a new variable.
Here is a small explanation of my required output. It is described below,
Lets say the data frame is eatable
fruits_area fruits_production vegetable_area vegetable_production
12 100 26 324
33 250 40 580
66 510 43 581
eatable <- data.frame(c(12,33,660),c(100,250,510),c(26,40,43),c(324,580,581))
names(eatable) <- c("fruits_area", "fruits_production", "vegetables_area",
"vegetable_production")
I was trying to write a function which will match the strings in a loop and will store the subset columns after matching first 5 letters from the column names.
checkExpression <- function(dataset,str){
dataset[grepl((str),names(dataset),ignore.case = TRUE)]
}
checkExpression(eatable,"your_string")
The above function checks the string correctly but I am confused how to do matching among the column names in the dataset.
Edit:- I think regular expressions would work here.
You could try:
v <- unique(substr(names(eatable), 0, 5))
lapply(v, function(x) eatable[grepl(x, names(eatable))])
Or using map() + select_()
library(tidyverse)
map(v, ~select_(eatable, ~matches(.)))
Which gives:
#[[1]]
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
#
#[[2]]
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581
Should you want to make it into a function:
checkExpression <- function(df, l = 5) {
v <- unique(substr(names(df), 0, l))
lapply(v, function(x) df[grepl(x, names(df))])
}
Then simply use:
checkExpression(eatable, 5)
I believe this may address your needs:
checkExpression <- function(dataset,str){
cols <- grepl(paste0("^",str),colnames(dataset),ignore.case = TRUE)
subset(dataset,select=colnames(dataset)[cols])
}
Note the addition of "^" to the pattern used in grepl.
Using your data:
checkExpression(eatable,"fruit")
## fruits_area fruits_production
##1 12 100
##2 33 250
##3 660 510
checkExpression(eatable,"veget")
## vegetables_area vegetable_production
##1 26 324
##2 40 580
##3 43 581
Your function does exactly what you want but there was a small error:
checkExpression <- function(dataset,str){
dataset[grepl((str),names(dataset),ignore.case = TRUE)]
}
Change the name of the object from which your subsetting from obje to dataset.
checkExpression(eatable,"fr")
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
checkExpression(eatable,"veg")
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581

Custom sorting of a dataframe in R

I have a binomail dataset that looks like this:
df <- data.frame(replicate(4,sample(1:200,1000,rep=TRUE)))
addme <- data.frame(replicate(1,sample(0:1,1000,rep=TRUE)))
df <- cbind(df,addme)
df <-df[order(df$replicate.1..sample.0.1..1000..rep...TRUE..),]
The data is currently soreted in a way to show the instances belonging to 0 group then the ones belonging to the 1 group. Is there a way I can sort the data in a 0-1-0-1-0... fashion? I mean to show a row that belongs to the 0 group, the row after belonging to the 1 group then the zero group and so on...
All I can think about is complex functions. I hope there's a simple way around it.
Thank you,
Here's an attempt, which will add any extra 1's at the end:
First make some example data:
set.seed(2)
df <- data.frame(replicate(4,sample(1:200,10,rep=TRUE)),
addme=sample(0:1,10,rep=TRUE))
Then order:
with(df, df[unique(as.vector(rbind(which(addme==0),which(addme==1)))),])
# X1 X2 X3 X4 addme
#2 141 48 78 33 0
#1 37 111 133 3 1
#3 115 153 168 163 0
#5 189 82 70 103 1
#4 34 37 31 174 0
#6 189 171 98 126 1
#8 167 46 72 57 0
#7 26 196 30 169 1
#9 94 89 193 134 1
#10 110 15 27 31 1
#Warning message:
#In rbind(which(addme == 0), which(addme == 1)) :
# number of columns of result is not a multiple of vector length (arg 1)
Here's another way using dplyr, which would make it suitable for within-group ordering. It's also probably pretty quick. If there's unbalanced numbers of 0's and 1's, it will leave them at the end.
library(dplyr)
df %>%
arrange(addme) %>%
mutate(n0 = sum(addme == 0),
orderme = seq_along(addme) - (n0 * addme) + (0.5 * addme)) %>%
arrange(orderme) %>%
select(-n0, -orderme)

Resources