Creating a dataframe grouping observations according to labels - r

I have an x vector with categorical variables and a y vector of numerical variables, both of the same length.
I need to create a data-frame in which all numerical observations in y are separated into groups by a categorical label in x so the end result would look something like:
x obs1 obs2 obs3
a 1 3 5
b 6 7 8
c 3 4 6
Now both aggregate and tapply require a FUN specification but I don't want to do operations on the variables.
x= {random sampling from letters of the alphabet}
y= {random numbers}

Remember, everything is a function in R. So things like c() are just function calls.
x <- rep(letters[1:3], each=3)
y <- c(1, 3, 5, 6, 7, 8, 3, 4, 6)
foo <- tapply(y, x, c)
# > foo
# $a
# [1] 1 3 5
# $b
# [1] 6 7 8
# $c
# [1] 3 4 6
Then you can use this silly pattern to get the data.frame you're looking for:
do.call(rbind, foo)
# [,1] [,2] [,3]
# a 1 3 5
# b 6 7 8
# c 3 4 6

I am not clear about something from your example: is it possible for there to be different numbers of y-values for each category in x? For example, would you consider basic data like this:
> x <- c(rep(c("a", "b", "c"), 3), "c", "c")
> y <- sample(1:20, 11)
> df <- data.frame(x, y)
> df
x y
1 a 16
2 b 4
3 c 9
4 a 2
5 b 12
6 c 17
7 a 7
8 b 10
9 c 11
10 c 1
11 c 8
Here there are more values for category c. This is not entirely what you are looking for, but it might be a start:
> library(reshape2)
> dcast(df, x ~ y)
Using y as value column: use value.var to override.
x 1 2 4 7 8 9 10 11 12 16 17
1 a NA 2 NA 7 NA NA NA NA NA 16 NA
2 b NA NA 4 NA NA NA 10 NA 12 NA NA
3 c 1 NA NA NA 8 9 NA 11 NA NA 17
The values for each of the categories appear on the right rows... the NAs are a nuisance though. How would you want the data to appear in this case? Something like
1 a 2 7 16
2 b 4 10 12
3 c 1 8 9 11 17
This will not work, of course, because each row must have the same number of columns, so you would end up with NAs for the last two elements in the top two rows.
However, I suspect that a list would probably be the best solution in this case anyway, in which case, consider this:
> dl <- split(y, x)
> dl[["a"]]
[1] 16 2 7
> dl$b
[1] 4 12 10
> dl[["c"]]
[1] 9 17 11 1 8
You can then operate on the elements of this list. As with all things R, there are a variety of ways to do this. For example, to get the output as a list:
> lapply(dl, sum)
$a
[1] 25
$b
[1] 26
$c
[1] 46
Or with output as a vector
> sapply(dl, sum)
a b c
25 26 46
Or, alternatively, to get the output as a data frame:
> library(plyr)
> ldply(dl, sum)
.id V1
1 a 25
2 b 26
3 c 46
These mechanisms afford a far greater degree of generality than functions like rowSum() since you can apply essentially arbirary functions to each of the elements in the original list.

Related

Subseting column in one data frame using two columns in another data frame in r

I have tried for the similar problem on SO but couldn't.
I have two data frames. I want to subset one column in one data frame using two columns in another data frame.
The data frame are as following.
df1 <- data.frame(x = c(22,23,22,34,21),
y = c(1,4,2,3,2))
df1
x y
1 22 1
2 23 4
3 22 2
4 34 3
5 21 2
df2 <- data.frame(a = c("John", "Matt", "foo","boo"),
b = c(4, NA, NA,2),
c = c(3, NA, 3, 3))
df2
a b c
1 John 4 3
2 Matt NA NA
3 foo NA 3
4 boo 2 3
I want to subset column df1$y using column b and c from dataframe df2 using vectorized operation.
The output should in list form as following
df1
df1[1]
x y
2 23 4
4 34 3
df1[2]
df1[3]
x y
4 34 3
df1[4]
x y
3 22 2
4 34 3
5 21 2
You can try something like this:
dfnew<-list()
for (i in 1:nrow(df2)){
dfnew[[i]]<-df1[which(df1$y %in% df2[i,2:3]),]
}
Result:
dfnew
[[1]]
x y
2 23 4
4 34 3
[[2]]
[1] x y
<0 rows> (or 0-length row.names)
[[3]]
x y
4 34 3
[[4]]
x y
3 22 2
4 34 3
5 21 2
We can use lapply
lapply(split(df2[-1], as.character(df2$a)), function(x) df1[df1$y %in% unlist(x),])

How to do iterations in R?

I'm operating with a dataset that contains the values of same variables at different points in time. In the example below I have the values of variables a and b at time points 1 and 2.
> set.seed(1)
> data <- data.frame(matrix(sample(16), ncol = 4))
> names(data) <- paste(rep(c("a", "b"), each = 2), 1:2, sep = "")
> data
a1 a2 b1 b2
1 5 3 14 13
2 6 10 1 8
3 9 11 2 4
4 12 15 7 16
Now, suppose I want to calculate a new variable for both time points so that it would contain the sum of a and b (instead of the NAs as in example below). Since my actual dataset contains about 15 different variables and 10 time points (so 150 columns), I want to automate this calculation of 10 new variables.
> data[, paste("ab", 1:2, sep = "")] <- NA
> data
a1 a2 b1 b2 ab1 ab2
1 5 3 14 13 NA NA
2 6 10 1 8 NA NA
3 9 11 2 4 NA NA
4 12 15 7 16 NA NA
I've previously used Stata where I could create a simple 'foreach' loop to do this. Something like below.
foreach t of numlist 1/2 {
generate ab`t' = a`t' + b`t'
}
But I've learned that using loops in R is not feasible, nor have I any idea how to loop over variable names like that in R.
So what would be the correct solution for my problem in R?
This will replicate the same foreach loop you used in Stata.
for(i in 1:2){
data[, paste("ab", i, sep="")] <-
data[,paste("a", i, sep="")] + data[, paste("b", i, sep="")]
}
The output looks like this:
> data
a1 a2 b1 b2 ab1 ab2
1 15 1 16 12 31 13
2 10 7 14 3 24 10
3 2 5 9 4 11 9
4 6 8 13 11 19 19
to do this the R way,
make use of some native iteration via a *apply function
use the built-in rowSums (as in #Sotos) answer
make use of assignment into the data.frame, that is `]`<-
all together
data[paste0('ab', 1:2)] <- sapply(1:2,
function(i)
rowSums(data[paste0(c('a', 'b'), i)]))
data
# a1 a2 b1 b2 ab1 ab2
# 1 5 3 14 13 19 16
# 2 6 10 1 8 7 18
# 3 9 11 2 4 11 15
# 4 12 15 7 16 19 31
ps, in a program use vapply instead, you'll need to provide an additional argument specifying the shape of the output but its safer and sometimes faster
You can do without iteration:
data$ab1 <- data$a1 + data$b1
data$ab2 <- data$a2 + data$b2
or
data <- transform(data, ab1=a1+b1, ab2=a2+b2)
BTW:
It is better not to name an object data because data= is often a parameter in functions.
Here is one way to do it. We iterate over the unique values of the column names and we calculate the rowSums when those unique values match the colname values.
sapply(unique(sub('\\D', '', names(data))),
function(i) rowSums(data[,grepl(i, sub('\\D', '', names(data)))]))
# 1 2
#[1,] 17 23
#[2,] 24 22
#[3,] 14 10
#[4,] 15 11

Create multiple data frames from one based off values with a for loop

I have a large data frame that I would like to convert in to smaller subset data frames using a for loop. I want the new data frames to be based on the the values in a column in the large/parent data frame. Here is an example
x<- 1:20
y <- c("A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","C","C","C")
df <- as.data.frame(cbind(x,y))
ok, now I want three data frames, one will be columns x and y but only where y == "A", the second where y==
"B" etc etc. So the end result will be 3 new data frames df.A, df.B, and df.C. I realize that this would be easy to do out of a for loop but my actual data has a lot of levels of y so using a for loop (or similar) would be nice.
Thanks!
If you want to create separate objects in a loop, you can use assign. I used unique because you said you had many levels.
for(i in unique(df$y)) {
nam <- paste("df", i, sep = ".")
assign(nam, df[df$y==i,])
}
> df.A
x y
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 6 A
7 7 A
8 8 A
> df.B
x y
9 9 B
10 10 B
11 11 B
12 12 B
13 13 B
14 14 B
I think you just need the split function:
split(df, df$y)
$A
x y
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 6 A
7 7 A
8 8 A
$B
x y
9 9 B
10 10 B
11 11 B
12 12 B
13 13 B
14 14 B
15 15 B
16 16 B
17 17 B
$C
x y
18 18 C
19 19 C
20 20 C
It is just a matter of properly subsetting the output to split and store the results to objects like dfA <- split(df, df$y)[[1]] and dfB <- split(df, df$y)[[2]] and so on.

Subset columns using logical vector

I have a dataframe that I want to drop those columns with NA's rate > 70% or there is dominant value taking over 99% of rows. How can I do that in R?
I find it easier to select rows with logic vector in subset function, but how can I do the similar for columns? For example, if I write:
isNARateLt70 <- function(column) {//some code}
apply(dataframe, 2, isNARateLt70)
Then how can I continue to use this vector to subset dataframe?
If you have a data.frame like
dd <- data.frame(matrix(rpois(7*4,10),ncol=7, dimnames=list(NULL,letters[1:7])))
# a b c d e f g
# 1 11 2 5 9 7 6 10
# 2 10 5 11 13 11 11 8
# 3 14 8 6 16 9 11 9
# 4 11 8 12 8 11 6 10
You can subset with a logical vector using one of
mycols<-c(T,F,F,T,F,F,T)
dd[mycols]
dd[, mycols]
There's really no need to write a function when we have colMeans (thanks #MrFlick for the advice to change from colSums()/nrow(), and shown at the bottom of this answer).
Here's how I would approach your function if you want to use sapply on it later.
> d <- data.frame(x = rep(NA, 5), y = c(1, NA, NA, 1, 1),
z = c(rep(NA, 3), 1, 2))
> isNARateLt70 <- function(x) mean(is.na(x)) <= 0.7
> sapply(d, isNARateLt70)
# x y z
# FALSE TRUE TRUE
Then, to subset with the above line your data using the above line of code, it's
> d[sapply(d, isNARateLt70)]
But as mentioned, colMeans works just the same,
> d[colMeans(is.na(d)) <= 0.7]
# y z
# 1 1 NA
# 2 NA NA
# 3 NA NA
# 4 1 1
# 5 1 2
Maybe this will help too. The 2 parameter in apply() means apply this function column wise on the data.frame cars.
> columns <- apply(cars, 2, function(x) {mean(x) > 10})
> columns
speed dist
TRUE TRUE
> cars[1:10, columns]
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
7 10 18
8 10 26
9 10 34
10 11 17

Convert a matrix with dimnames into a long format data.frame

Hoping there's a simple answer here but I can't find it anywhere.
I have a numeric matrix with row names and column names:
# 1 2 3 4
# a 6 7 8 9
# b 8 7 5 7
# c 8 5 4 1
# d 1 6 3 2
I want to melt the matrix to a long format, with the values in one column and matrix row and column names in one column each. The result could be a data.table or data.frame like this:
# col row value
# 1 a 6
# 1 b 8
# 1 c 8
# 1 d 1
# 2 a 7
# 2 c 5
# 2 d 6
...
Any tips appreciated.
Use melt from reshape2:
library(reshape2)
#Fake data
x <- matrix(1:12, ncol = 3)
colnames(x) <- letters[1:3]
rownames(x) <- 1:4
x.m <- melt(x)
x.m
Var1 Var2 value
1 1 a 1
2 2 a 2
3 3 a 3
4 4 a 4
...
The as.table and as.data.frame functions together will do this:
> m <- matrix( sample(1:12), nrow=4 )
> dimnames(m) <- list( One=letters[1:4], Two=LETTERS[1:3] )
> as.data.frame( as.table(m) )
One Two Freq
1 a A 7
2 b A 2
3 c A 1
4 d A 5
5 a B 9
6 b B 6
7 c B 8
8 d B 10
9 a C 11
10 b C 12
11 c C 3
12 d C 4
Assuming 'm' is your matrix...
data.frame(col = rep(colnames(m), each = nrow(m)),
row = rep(rownames(m), ncol(m)),
value = as.vector(m))
This executes extremely fast on a large matrix and also shows you a bit about how a matrix is made, how to access things in it, and how to construct your own vectors.
A modification that doesn't require you to know anything about the storage structure, and that easily extends to high dimensional arrays if you use the dimnames, and slice.index functions:
data.frame(row=rownames(m)[as.vector(row(m))],
col=colnames(m)[as.vector(col(m))],
value=as.vector(m))

Resources