How can R report the actual name i, when using it to name columns and lists in a for loop.
For example, using the following data:
z <- data.frame(x= c(1,2,3,4,5), y = c("a", "b", "v", "d", "e"))
When I reference i from the loop when creating the columns it names it i as the column names.
a_final <- NULL
for(i in z$x){
print(data.frame(i = z$y))
}
Instead, I'd like the columns to be named by the value of each i in the loop, instead.
I'd like the results to look something like:
1 2 3 4 5 6
a a a a a a
b b b b b b
c c c c c c
d d d d d d
e e e e e e
You could create a matrix with data from z$y and dimensions same as nrow(z) and convert it into dataframe.
as.data.frame(matrix(z$y, ncol = nrow(z), nrow = nrow(z)))
# V1 V2 V3 V4 V5
#1 a a a a a
#2 b b b b b
#3 c c c c c
#4 d d d d d
#5 e e e e e
We can also use replicate
as.data.frame(replicate(nrow(z), z$y))
Related
I got these two data frames:
a <- c('A','B','C','D','E','F','G','H')
b <- c(1,2,1,3,1,3,1,6)
c <- c('K','K','H','H','K','K','H','H')
frame1 <- data.frame(a,b,c)
a <- c('A','A','B','B','C','C','D','D','E','E','F','F','G','H','H')
d <- c(5,5,6,3,1,9,1,0,2,3,6,5,5,5,4)
e <- c('W','W','D','D','D','D','W','W','D','D','W','W','D','W','W')
frame2<- data.frame(a,d,e)
And now I want to include the column 'e' from 'frame2' into 'frame1' depending on the matching value in column 'a' of both data frames. Note: 'e' is the same for all rows with the same value in 'a'.
The result should look like this:
a b c e
1 A 1 K W
2 B 2 K D
3 C 1 H D
4 D 3 H W
5 E 1 K D
6 F 3 K W
7 G 1 H D
8 H 6 H W
Any sugestions?
You can use match to matching value in column 'a' of both data frames:
frame1$e <- frame2$e[match(frame1$a, frame2$a)]
frame1
# a b c e
#1 A 1 K W
#2 B 2 K D
#3 C 1 H D
#4 D 3 H W
#5 E 1 K D
#6 F 3 K W
#7 G 1 H D
#8 H 6 H W
or using merge:
merge(frame1, frame2[!duplicated(frame2$a), c("a", "e")], all.x=TRUE)
you can perform join operation on 'a' column of both dataframes and take those values only which are matched. you can do left join , and after that remove 'a' column from 2nd dataframe and also remove rest of the columns, which are'nt needed from 2nd dataframe.
Using dplyr :
library(dplyr)
frame2 %>%
distinct(a, e, .keep_all = TRUE) %>%
right_join(frame1, by = 'a') %>%
select(-d) %>%
arrange(a)
# a e b c
#1 A W 1 K
#2 B D 2 K
#3 C D 1 H
#4 D W 3 H
#5 E D 1 K
#6 F W 3 K
#7 G D 1 H
#8 H W 6 H
I have a dataframe with multiple factors and multiple numeric vars. I would like to collapse one of the factors (say by mean).
In my attempts I could only think of nested sapply or for loops to isolate the numerical elements to be averaged.
var <- data.frame(A = c(rep('a',8),rep('b',8)), B =
c(rep(c(rep('c',2),rep('d',2)),4)), C = c(rep(c('e','f'),8)),
D = rnorm(16), E = rnorm(16))
> var
A B C D E
1 a c e 1.1601720731 -0.57092435
2 a c f -0.0120178626 1.05003748
3 a d e 0.5311032778 1.67867806
4 a d f -0.3399901000 0.01459940
5 a c e -0.2887561691 -0.03847519
6 a c f 0.0004299922 -0.36695879
7 a d e 0.8124655890 0.05444033
8 a d f -0.3777058654 1.34074427
9 b c e 0.7380720821 0.37708543
10 b c f -0.3163496271 0.10921373
11 b d e -0.5543252191 0.35020193
12 b d f -0.5753686426 0.54642790
13 b c e -1.9973216646 0.63597405
14 b c f -0.3728926714 -3.07669300
15 b d e -0.6461596329 -0.61659041
16 b d f -1.7902722068 -1.06761729
sapply(4:ncol(var), function(i){
sapply(1:length(levels(var$A)), function(j){
sapply(1:length(levels(var$B)), function(t){
sapply(1:length(levels(var$C)), function(z){
mean(var[var$A == levels(var$A)[j] &
var$B == levels(var$B)[t] &
var$C == levels(var$C)[z],i])
})
})
})
})
[,1] [,2]
[1,] 0.435707952 -0.3046998
[2,] -0.005793935 0.3415393
[3,] 0.671784433 0.8665592
[4,] -0.358847983 0.6776718
[5,] -0.629624791 0.5065297
[6,] -0.344621149 -1.4837396
[7,] -0.600242426 -0.1331942
[8,] -1.182820425 -0.2605947
Is there a way to do this without this many sapply? maybe with mapply or outer
Maybe just,
var <- data.frame(A = c(rep('a',8),rep('b',8)), B =
c(rep(c(rep('c',2),rep('d',2)),4)), C = c(rep(c('e','f'),8)),
D = rnorm(16), E = rnorm(16))
library(dplyr)
var %>%
group_by(A,B,C) %>%
summarise_if(is.numeric,mean)
(Note that the output you show isn't what I get when I run your sapply code, but the above is identical to what I get when I run your sapply's.)
For inline aggregation (keeping same number of rows of data frame), consider ave:
var$D_mean <- with(var, ave(D, A, B, C, FUN=mean))
var$E_mean <- with(var, ave(E, A, B, C, FUN=mean))
For full aggregation (collapsed to factor groups), consider aggregate:
aggregate(. ~ A + B + C, var, mean)
I will complete the holy trinity with a data.table solution. Here .SD is a data.table of all the columns not listed in the by portion. This is a near-dupe of this question (only difference is >1 column being summarized), so click that if you want more solutions.
library(data.table)
setDT(var)
var[, lapply(.SD, mean), by = .(A, B, C)]
# A B C D E
# 1: a c e 0.07465822 0.032976115
# 2: a c f 0.40789460 -0.944631574
# 3: a d e 0.72054938 0.039781185
# 4: a d f -0.12463910 0.003363382
# 5: b c e -1.64343115 0.806838905
# 6: b c f -1.08122890 -0.707975411
# 7: b d e 0.03937829 0.048136471
# 8: b d f -0.43447899 0.028266455
I have a dataframe df with three categorical variables cat1,cat2,cat3 and two continuous variables con1,con2. I would like to compute list of functions sd,mean on list of columns con1,con2 based on different combinations of list of columns cat1,cat2,cat3. I have done them explicitly subsetting all different combinations.
# Random generation of values for categorical data
set.seed(33)
df <- data.frame(cat1 = sample( LETTERS[1:2], 100, replace=TRUE ),
cat2 = sample( LETTERS[3:5], 100, replace=TRUE ),
cat3 = sample( LETTERS[2:4], 100, replace=TRUE ),
con1 = runif(100,0,100),
con2 = runif(100,23,45))
# Introducing null values
df$con1[c(23,53,92)] <- NA
df$con2[c(33,46)] <- NA
results <- data.frame()
funs <- list(sd=sd, mean=mean)
# calculation of mean and sd on total observations
sapply(funs, function(x) sapply(df[,c(4,5)], x, na.rm=T))
# calculation of mean and sd on different levels of cat1
sapply(funs, function(x) sapply(df[df$cat1=='A',c(4,5)], x, na.rm=T))
sapply(funs, function(x) sapply(df[df$cat1=='B',c(4,5)], x, na.rm=T))
# calculation of mean and sd on different levels of cat1 and cat2
sapply(funs, function(x) sapply(df[df$cat1=='A' & df$cat2=='C' ,c(4,5)], x, na.rm=T))
.
.
.
sapply(funs, function(x) sapply(df[df$cat1=='B' & df$cat2=='E' ,c(4,5)], x, na.rm=T))
# Similarly for the combinations of three cat variables cat1, cat2, cat3
I would like to write a function on dynamically computing the list of functions for list of columns based on different combinations. Could you please give some suggestions. Thanks !
Edit:
I have already got some smart suggestions using dplyr. It would be great if someone provides suggestions using the apply family functions as it will help in using them(dataframes) in the further requirements.
This is a simple one-line base solution:
> do.call(cbind, lapply(funs, function(x) aggregate(cbind(con1, con2) ~ cat1 + cat2 + cat3, data = df, FUN = x, na.rm = TRUE)))
sd.cat1 sd.cat2 sd.cat3 sd.con1 sd.con2 mean.cat1 mean.cat2 mean.cat3 mean.con1 mean.con2
1 A C B NA NA A C B 25.52641 37.40603
2 B C B 32.67192 6.966547 B C B 46.70387 34.85437
3 A D B 31.05224 6.530313 A D B 37.91553 37.13142
4 B D B 23.80335 6.001468 B D B 59.75107 30.29681
5 A E B 22.79285 1.526472 A E B 38.54742 25.23007
6 B E B 32.92139 2.621067 B E B 51.56253 29.52367
7 A C C 26.98661 5.710335 A C C 36.32045 36.42465
8 B C C 20.22217 8.117184 B C C 60.60036 34.98460
9 A D C 33.39273 7.367412 A D C 40.77786 35.03747
10 B D C 12.95351 8.829061 B D C 49.77160 33.21836
11 A E C 33.73433 4.689548 A E C 55.53135 32.38279
12 B E C 25.38637 9.172137 B E C 46.69063 31.56733
13 A C D 36.12545 6.323929 A C D 48.34187 32.36789
14 B C D 30.01992 7.130869 B C D 53.87571 33.12760
15 A D D 15.94151 11.756115 A D D 35.89909 31.76871
16 B D D 10.89030 6.829829 B D D 22.86577 32.53725
17 A E D 24.88410 6.108631 A E D 47.32549 35.22782
18 B E D 12.73711 8.151424 B E D 33.95569 36.70167
I have this R data frame:
v1 <- LETTERS[1:10]
v2 <- LETTERS[1:4]
v3 <- LETTERS[4:5]
dat <- data.frame(cbind(v1,v2,v3))
v1 v2 v3
A A D
B B E
C C D
D D E
E A D
F B E
G C D
H D E
I A D
J B E
I would like to get a count of the number of occurrences of a given value (e.g. "A") for each column,
and save that as a new column in my data frame.
I my dataframe I want to calculate the occurrences af "A" in column v1 thru v3, and make a new column (CountA) with the count of A's.
My desired output would be:
v1 v2 v3 CountA
A A D 2
B B E 0
C C D 0
D D E 0
E A D 1
F B E 0
G C D 0
H D E 0
I A D 1
J B E 0
Try this:
dat$CountA <- rowSums(dat=="A")
This question already has answers here:
How to find mode across variables/vectors within a data row in R
(3 answers)
Closed 9 years ago.
Is it possible to count unique elements in data frame row and return one with maximum occurrence and as result form the vector.
example:
a a a b b b b -> b
c v f w w r t -> w
s s d f b b b -> b
You can use apply to use table function on every row of dataframe.
df <- read.table(textConnection("a a a b b b b\nc v f w w r t\ns s d f b b b"), header = F)
df$result <- apply(df, 1, function(x) names(table(x))[which.max(table(x))])
df
## V1 V2 V3 V4 V5 V6 V7 result
## 1 a a a b b b b b
## 2 c v f w w r t w
## 3 s s d f b b b b
Yes with table
x=c("a", "a", "a", "b" ,"b" ,"b" ,"b")
table(x)
x
a b
3 4
EDIT with data.table
DT = data.table(x=sample(letters[1:5],10,T),y=sample(letters[1:5],10,T))
#DT
# x y
# 1: d a
# 2: c d
# 3: d c
# 4: c a
# 5: a e
# 6: d c
# 7: c b
# 8: a b
# 9: b c
#10: c d
f = function(x) names(table(x))[which.max(table(x))]
DT[,lapply(.SD,f)]
# x y
#1: c c
Note that if you want to keep ALL max's, you need to ask for them explicitly.
You can save them as a list inside the data.frame. If there is only one per row, then the list will be simplified to a common vector
df$result <- apply(df, 1, function(x) {T <- table(x); list(T[which(T==max(T))])})
With Ties for max:
df2 <- df[, 1:6]
df2$result <- apply(df2, 1, function(x) {T <- table(x); list(T[which(T==max(T))])})
> df2
V1 V2 V3 V4 V5 V6 result
1 a a a b b b 3, 3
2 c v f w w r 2
3 s s d f b b 2, 2
With No Ties for max:
df$result <- apply(df, 1, function(x) {T <- table(x); list(T[which(T==max(T))])})
> df
V1 V2 V3 V4 V5 V6 V7 result
1 a a a b b b b 4
2 c v f w w r t 2
3 s s d f b b b 3