Store R loop result and combine it with new result - r

I'm pretty new to R loop and sorry if this question is too simple. I'm trying to write a loop to subset data. The codes are:
a <- sample(rep(1:5, 10), 10)
b <- sample(rep(1:5, 10), 10)
c <- data.frame(a, b)
s <- c(1,2)
for (i in s){
x <- data.frame()
x <- rbind(x, c[which(a==i),])
}
The x only includes the result for a=2. But when I deleted x and used print() command, it gave me a data frame under the conditions of a=1 and a=2. I don't know what's wrong with the loop. Thanks!!

You can avoid for loop and subset rows by matching the values in s1 with a1.
set.seed(1L)
a1 <- sample(rep(1:5, 10), 10)
b1 <- sample(rep(1:5, 10), 10)
c1 <- data.frame(a1, b1)
s1 <- c(1,2)
a1 %in% s1
# [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE
c1[ a1 %in% s1, ]
# a1 b1
# 6 1 3
# 7 2 2
# 9 2 1

Already there good comments and answer has been mentioned for this. Still I wanted to clarify few points which can help OP.
Obviously for-loop are very much r-like as loops are not very efficient in many cases. Even though if you want to fix the problem in your loop then just modify it as:
# Calling seed will ensure same output from function like sample. This will
# generate consistent result in every attempt
set.seed(1)
a <- sample(rep(1:5, 10), 10)
b <- sample(rep(1:5, 10), 10)
c <- data.frame(a, b) # good to name it df
s <- c(1,2)
# Fix for-loop
x <- data.frame() #assign x out of the for-loop
for (i in s){
x <- rbind(x, c[which(a==i),])
}
#Result
> x
# a b
#6 1 3
#7 2 2
#9 2 1
# R-like approach
> c[c$a %in% s,] #use the column of 'c' dataframe directly in condition
# a b
#6 1 3
#7 2 2
#9 2 1

Related

iterating table() results into matrix/data frame

This must be simple but I'm banging my head against it for a while. Please help. I have a large data set from which I get all kinds of information via table(). I then want to store these counts, with the rownames that were counted. For a reproducible example consider
a <- c("a", "b", "c", "d", "a", "b") # one count, occurring twice for a and
# b and once for c and d
b <- c("a", "c") # a completly different property from the dataset
# occurring once for a and c
x <- table(a)
y <- table(b) # so now x and y hold the information I seek
How can I merge/bind/whatever to get from x and y to this form:
x. y.
a 2. 1
b 2. 0
c 1. 1
d. 1 0
HOWEVER, I need to use the solution to work iteratively, in a loop that takes x and y and gets the requested form above, and then gets more tables added, each hopefully adding a column. One of my many failed attempts, just to show my (probably flawed) logic, is:
member <- function (data = dfm, groupvar = 'group', analysis = kc15) {
res<-matrix(NA,ncol=length(analysis$size)+1) #preparing an object for the results
res[,1]<-table(docvars(data,groupvar)) #getting names and totals of groups
for (i in 1:length(analysis$size)) { #getting a bunch of counts that I care about
r<-table(docvars(data,groupvar)[analysis$cluster==i])
res<-cbind(res,r) #here's the problem, trying to add each new count as a column.
}
res
}
So, to sum, the reproducible example above means to replicate the first column in res and an r, and I'm seeking (I think) a correct solution instead of the cbind, which would allow adding columns of different length but similar names, as in the example above.
Please help its embarrassing how much time I'm wasting on this
The following may be an option, which merges on the "row names" of the data frames, converted from the frequency tables:
df <- merge(as.data.frame(x, row.names=1, responseName ="x"),
as.data.frame(y, row.names=1, responseName ="y"),
by="row.names", all=TRUE)
df[is.na(df)] <- 0; df
Row.names x y
1 a 2 1
2 b 2 0
3 c 1 1
4 d 1 0
Then, this method can be incorporated into your real data with some modification. I've made up the data since I didn't have any to work with.
set.seed(1234)
groupvar <- sample(letters[1:4], 16, TRUE)
clusters <- 1:4
cluster <- rep(clusters, each=4)
Merge the first two tables:
res <- merge(as.data.frame(table(groupvar[cluster==1]),
row.names=1, responseName=clusters[1]),
as.data.frame(table(groupvar[cluster==2]),
row.names=1, responseName=clusters[2]),
by="row.names", all=TRUE)
Then merge the others using your for loop.
for (i in 3:length(clusters)) {
r <- table(groupvar[cluster==i])
res <- merge(res, as.data.frame(r, row.names=1, responseName = clusters[i]),
by.x="Row.names", by.y="row.names", all=TRUE)
}
res[is.na(res)] <- 0
res
Row.names X1 X2 X3 X4
1 a 1 2 0 0
2 b 1 1 2 2
3 c 0 1 1 2
4 d 2 0 1 0
merge the transposed and re-transpose.
res <- t(merge(t(unclass(x)), t(unclass(y)), all=TRUE))
res <- `colnames<-`(res[order(rownames(res)), 2:1], c("x", "y"))
res[is.na(res)] <- 0
res
# x y
# a 2 1
# b 2 0
# c 1 1
# d 1 0

lapply to add column to existing dataframes

I have a list of data frames, and want to perform a function on each column in the data frame.
I've been googling for a while, but the issue I have is this:
df.1 <- data.frame(data=cbind(rnorm(5, 0), rnorm(5, 2), rnorm(5, 5)))
df.2 <- data.frame(data=cbind(rnorm(5, 0), rnorm(5, 2), rnorm(5, 5)))
names(df.1) <- c("a", "b", "c")
names(df.2) <- c("a", "b", "c")
ls.1<- list(df.1,df.2)
res <- lapply(ls.1, function(x){
x$d <- x$b + x$c
return(x)
})
Returns a new list "res" with a group of unnamed dataframes in them (res[[1]], res[[2]] etc).
[[1]]
a b c d
1 2.2378686 3.640607 4.793172 8.433780
2 -0.4411046 3.690850 5.290814 8.981664
3 -1.1490879 3.081092 4.982820 8.063912
4 -0.3024211 1.929033 4.743569 6.672602
5 1.3658726 3.395564 2.800131 6.195695
[[2]]
a b c d
1 0.3452530 3.264709 7.384127 10.648836
2 -1.2031949 3.118633 4.840496 7.959129
3 0.6177369 1.119107 4.938917 6.058024
4 -1.0470713 1.942357 5.747748 7.690106
5 0.8732836 2.704501 5.805754 8.510254
I'm interested in adding columns to the original dataframes (df.1, df.2) How would I do this?
You can name your list elements, or use tibble::lst which will do it for you:
ls.1<- list(df.1 = df.1,df.2 = df.2)
ls.2<- tibble::lst(df.1, df.2)
res1 <- lapply(ls.1, function(x){
x$d <- x$b + x$c
return(x)
})
res2 <- lapply(ls.2, function(x){
x$d <- x$b + x$c
return(x)
})
# $df.1
# a b c d
# 1 0.6782608 4.0774244 2.845351 6.922776
# 2 2.3620601 1.9395314 5.438832 7.378364
# 3 -0.5913838 2.0579972 4.312360 6.370357
# 4 0.5532147 0.8581389 5.867889 6.726027
# 5 -0.3251044 1.9838598 4.321008 6.304867
#
# $df.2
# a b c d
# 1 1.9918131 3.195105 5.715858 8.910963
# 2 0.2525537 2.507358 5.040691 7.548050
# 3 0.5038298 3.112855 5.265974 8.378830
# 4 0.4873384 3.377182 5.685714 9.062896
# 5 -0.6539881 0.157948 5.407508 5.565456
To overwrite the original data.frames you can use list2env on the output.
In order to add columns, you will have to either overwrite your ls.1 with res or perhaps manually assign result to your original data.frames, e.g. df.1 <- res[[1]]. But there are a hundred ways to skin a cat (pun intended) and there may be other better approaches.

Select rows in a data.frame when some rows repeat

I have the following toy dataset
set.seed(100)
df <- data.frame(ID = rep(1:5, each = 3),
value = sample(LETTERS, 15, replace = TRUE),
weight = rep(c(0.1, 0.1, 0.5, 0.2, 0.1), each = 3))
df
ID value weight
1 1 I 0.1
2 1 G 0.1
3 1 O 0.1
4 2 B 0.1
5 2 M 0.1
6 2 M 0.1
7 3 V 0.5
8 3 J 0.5
9 3 O 0.5
10 4 E 0.2
11 4 Q 0.2
12 4 W 0.2
13 5 H 0.1
14 5 K 0.1
15 5 T 0.1
where each ID is an individual respondent, answering 3 questions (in the actual dataset, the number of questions answered is variable, so I can't rely on a certain number of rows per ID).
I want to create a new (larger) dataset which samples from the individual IDs based on the weights in weight.
probs <- data.frame(ID = unique(df$ID))
probs$prob <- NA
for(i in 1:nrow(probs)){
probs$prob[i] <- df[df$ID %in% probs$ID[i],]$weight[1]
}
probs$prob <- probs$prob / sum(probs$prob)
sampledIDs <- sample(probs$ID, size = 10000, replace = TRUE, prob = probs$prob)
head(sampledIDs,10)
[1] 4 3 3 3 4 4 2 4 2 3
Moving from the probabilistic sampling of IDs to the actual creation of the new data.frame is stumping me. I've tried
dfW <- df[df$ID %in% sampledIDs,]
but that obviously doesn't take into account the fact that IDs repeat. I've also tried a loop:
dfW <- df[df$ID == sampledIDs[1],]
for(i in 2:length(sampledIDs)){
dfW <- rbind(dfW, df[df$ID == sampledIDs[i],])
}
but that's painfully slow with a large dataset.
Any help would be very appreciated.
(Also, if there are simpler ways of doing the probabilistic selection of IDs, that would be great to hear too!)
The code speed is low because you resize the data frame in every cycle of the for loop. Here is my suggestion. Create a dataframe with the final size that the data framedfW will have before the for loop. Then assign the values from data frame df to dfW in the for loop. You may change the last part of your code with this:
dfW <- as.data.frame(matrix(nrow = 3 * length(sampledIDs), ncol = 3))
colnames(dfW) <- colnames(df) # make the column names the same
for(i in 1:length(sampledIDs)){ # notice the start index is changed from 2 to 1
#dfW <- rbind(dfW, df[df$ID == sampledIDs[i],])
dfW[(3*i-2):(3*i),] <- df[df$ID == sampledIDs[i],]
}
Your code should run much faster with this change. Let me know how it goes!
If you don't know the final size you can resize it whenever needed, but a new if condition should be added in the for loop. First define the function to resize the dataframe as follow:
double_rowsize <- function(df) {
mdf <- as.data.frame(matrix(, nrow = nrow(df), ncol = ncol(df)))
colnames(mdf) <- colnames(df)
df <- rbind(df, mdf)
return(df)
}
Then start the dfW with an initial size like 12 (3 times 4):
dfW <- as.data.frame(matrix(nrow = 12, ncol = 3))
colnames(dfW) <- colnames(df)
And finally add an if condition in the for loop to resize the dataframe whenever needed:
for(i in 1:length(sampledIDs)){
if (3*i > nrow(dfW))
dfW <- double_rowsize(dfW)
dfW[(3*i-2):(3*i),] <- df[df$ID == sampledIDs[i],]
}
You can change the details of function double_rowsize to change the dataframe size with a different number rather than 2 if anything else works better. 2 is common because it works best in array resizing.
Good luck!

How to combine a data frame and a vector

df<-data.frame(w=c("r","q"), x=c("a","b"))
y=c(1,2)
How do I combine df and y into a new data frame that has all combinations of rows from df with elements from y? In this example, the output should be
data.frame(w=c("r","r","q","q"), x=c("a","a","b","b"),y=c(1,2,1,2))
w x y
1 r a 1
2 r a 2
3 q b 1
4 q b 2
This should do what you're trying to do, and without too much work.
dl <- unclass(df)
dl$y <- y
merge(df, expand.grid(dl))
# w x y
# 1 q b 1
# 2 q b 2
# 3 r a 1
# 4 r a 2
data.frame(lapply(df, rep, each = length(y)), y = y)
this should work
library(combinat)
df<-data.frame(w=c("r","q"), x=c("a","b"))
y=c("one", "two") #for generality
indices <- permn(seq_along(y))
combined <- NULL
for(i in indices){
current <- cbind(df, y=y[unlist(i)])
if(is.null(combined)){
combined <- current
} else {
combined <- rbind(combined, current)
}
}
print(combined)
Here is the output:
w x y
1 r a one
2 q b two
3 r a two
4 q b one
... or to make it shorter (and less obvious):
combined <- do.call(rbind, lapply(indices, function(i){cbind(df, y=y[unlist(i)])}))
First, convert class of columns from factor to character:
df <- data.frame(lapply(df, as.character), stringsAsFactors=FALSE)
Then, use expand.grid to get a index matrix for all combinations of rows of df and elements of y:
ind.mat = expand.grid(1:length(y), 1:nrow(df))
Finally, loop through the rows of ind.mat to get the result:
data.frame(t(apply(ind.mat, 1, function(x){c(as.character(df[x[2], ]), y[x[1]])})))

How to find which elements of one set are in another set?

I have two sets: A with columns x,y, and B also with columns x, y.
I need to find the index of the rows of A which are inside of B (both x and y must match).
I have come up with a simple solution (see below), but this comparison is inside of the loop and paste adds much more extra time.
B <- data.frame(x = sample(1:1000, 1000), y = sample(1:1000, 1000))
A <- B[sample(1:1000, 10),]
#change some elements
A$x[c(1,3,7,10)] <- A$x[c(1,3,7,10)] + 0.5
A$xy <- paste(A$x, A$y, sep='ZZZ')
B$xy <- paste(B$x, B$y, sep='ZZZ')
indx <- which(A$xy %in% B$xy)
indx
For example for a single observation an alternative to paste is almost 3 times faster
ind <- sample(1:1000, 1)
xx <- B$x[ind]
yy <- B$y[ind]
ind <- which(with(B, x==xx & y==yy))
# [1] 0.0160000324249268 seconds
xy <- paste(xx,'ZZZ',yy, sep='')
ind <- which(B$xy == xy)
# [1] 0.0469999313354492 seconds
How about using merge() to do the matching for you?
A$id <- seq_len(nrow(A))
sort(merge(A, B)$id)
# [1] 2 4 5 6 8 9
Edit:
Or, to get rid of two unnecessary sorts, use the sort= option to merge()
merge(A, B, sort=FALSE)$id
# [1] 2 4 5 6 8 9

Resources