How to save all loop's results in a csv - r
Consider this toy data frame:
df <- read.table(text = "target birds wolfs
0 21 7
0 8 4
1 2 5
1 2 4
0 8 3
1 1 12
1 7 10
1 1 9 ",header = TRUE)
I would like to run a loop function that will calculate the mean per each variable's 2:5 rows and save all results as a CSV.
I wrote this line of code:
for(i in names(df)) {print(mean(df[2:5,i]))}
and got the following results:
[1] 0.5
[1] 5
[1] 4
But when I tried to export it to csv using the code below I got in the file only the last result: [1] 4.
code:
for(i in names(df)) { j<-(mean(df[2:5,i]))
write.csv(j,"j.csv") }
How can I get in the same csv file a list of all the results?
In dplyr, you could use summarise_each to perform a computation on every columns of your data frame.
library(dplyr)
j <- slice(df,2:5) %>% # selects rows 2 to 5
summarise_each(funs=funs(mean(.))) # computes mean of each column
The results are in a data.frame:
j
target birds wolfs
1 0.5 5 4
If you want each variable mean on a separate line, use t()
t(j)
[,1]
target 0.5
birds 5.0
wolfs 4.0
And to export the results:
write.csv(t(j),"j.csv")
Related
R rearrange data
I have a bunch of texts written by the same person, and I'm trying to estimate the templates they use for each text. The way I'm going about this is: create a TermDocumentMatrix for all the texts take the raw Euclidean distance of each pair cut out any pair greater than X distance (10 for the sake of argument) flatten the forest return one example of each template with some summarized stats I'm able to get to the point of having the distance pairs, but I am unable to convert the dist instance to something I can work with. There is a reproducible example at the bottom. The data in the dist instance looks like this: The row and column names correspond to indexes in the original list of texts which I can use to do achieve step 5. What I have been trying to get out of this is a sparse matrix with col name, row name, value. col, row, value 1 2 14.966630 1 3 12.449900 1 4 13.490738 1 5 12.688578 1 6 12.369317 2 3 12.449900 2 4 13.564660 2 5 12.922848 2 6 12.529964 3 4 5.385165 3 5 5.830952 3 6 5.830952 4 5 7.416198 4 6 7.937254 5 6 7.615773 From this point I would be comfortable cutting out all pairs greater than my cutoff and flattening the forest, i.e. returning 3 templates in this example, a group containing only document 1, a group containing only document 2 and a third group containing documents 3, 4, 5, and 6. I have tried a bunch of things from creating a matrix out of this and then trying to make it sparse, to directly using the vector inside of the dist class, and I just can't seem to figure it out. Reproducible example: tdm <- matrix(c(1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,2,0,0,0,0,0,1,0,0,0,0,0,3,1,2,2,2,3,2,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,1,2,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,4,1,1,1,1,1,0,0,1,1,1,1,0,1,0,0,0,0,0,0,1,1,1,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,2,0,0,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,1,1,0,1,1,1,1,0,1,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,1,0,0,1,1,1,1,0,1,0,1,0,0,2,0,0,0,0,0,1,0,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,3,1,1,1,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1,1,0,0,0,1,0,0,2,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,3,1,1,1,1,0,1,0,0,0,0,1,2,0,1,1,0,0,0,0,1,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1,1,1,1,0,1,0,0,0,0,0,0,0,1,0,0,1,1,1,1,1,1,0,0,0,0,1,0,0,1,0,1,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,1,1,0,1,0,0,0,0,0,1,1,1,2,1,1,1,0,0,0,0,1,2,2,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,1,1,1,0,2,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,2,0,2,2,3,2,1,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,2,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,2,1,1,1,1,1,0,1,0,0,0,0,1,1,0,0,0,0,1,0,0,0,1,0,0,1,1,1,1,1,1,0,0,0,0,0,1,0,0,0,0,1,0,1,1,1,1,1,0,0,1,1,1,0,0,1,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1,2,1,1,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,1,0,2,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,3,0,1,1,1,1,0,0,1,0,1,1,1,0,0,0,0,0,1,0,0,0,0,0,4,2,4,6,4,3,1,0,1,2,1,1,0,1,0,0,0,0,2,0,0,0,0,0,0,1,1,1,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,2,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,2,1,2,2,2,2,1,0,1,2,1,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,2,2,2,2,2,2,3,3,4,5,3,1,2,1,1,1,1,1,1,0,0,0,0,3,3,0,0,1,1,0,1,0,0,0,0), nrow=6) rownames(tdm) <- 1:6 colnames(tdm) <- paste("term", 1:229, sep="") tdm.dist <- dist(tdm) # I'm stuck turning tdm.dist into what I have shown
A classic approach to turn a "matrix"-like object to a [row, col, value] "data.frame" is the as.data.frame(as.table(.)) route. Specifically here, we need: subset(as.data.frame(as.table(as.matrix(tdm.dist))), as.numeric(Var1) < as.numeric(Var2)) But that includes way too many coercions and creation of a larger object only to be subset immediately. Since dist stores its values in a "lower.tri"angle form we could use combn to generate the row/col indices and cbind with the "dist" object: data.frame(do.call(rbind, combn(attr(tdm.dist, "Size"), 2, simplify = FALSE)), c(tdm.dist)) Also, "Matrix" package has some flexibility that, along its memory efficiency in creating objects, could be used here: library(Matrix) tmp = combn(attr(tdm.dist, "Size"), 2) summary(sparseMatrix(i = tmp[2, ], j = tmp[1, ], x = c(tdm.dist), dims = rep_len(attr(tdm.dist, "Size"), 2), symmetric = TRUE)) Additionally, among different functions that handle "dist" objects, cutree(hclust(tdm.dist), h = 10) #1 2 3 4 5 6 #1 2 3 3 3 3 groups by specifying the cut height.
That's how I've done a very similar thing in the past using dplyr and tidyr packages. You can run the chained (%>%) script row by row to see how the dataset is updated step by step. tdm <- matrix(c(1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,2,0,0,0,0,0,1,0,0,0,0,0,3,1,2,2,2,3,2,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,1,2,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,4,1,1,1,1,1,0,0,1,1,1,1,0,1,0,0,0,0,0,0,1,1,1,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,2,0,0,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,1,1,0,1,1,1,1,0,1,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,1,0,0,1,1,1,1,0,1,0,1,0,0,2,0,0,0,0,0,1,0,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,3,1,1,1,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1,1,0,0,0,1,0,0,2,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,3,1,1,1,1,0,1,0,0,0,0,1,2,0,1,1,0,0,0,0,1,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1,1,1,1,0,1,0,0,0,0,0,0,0,1,0,0,1,1,1,1,1,1,0,0,0,0,1,0,0,1,0,1,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,1,1,0,1,0,0,0,0,0,1,1,1,2,1,1,1,0,0,0,0,1,2,2,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,1,1,1,0,2,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,2,0,2,2,3,2,1,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,2,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,2,1,1,1,1,1,0,1,0,0,0,0,1,1,0,0,0,0,1,0,0,0,1,0,0,1,1,1,1,1,1,0,0,0,0,0,1,0,0,0,0,1,0,1,1,1,1,1,0,0,1,1,1,0,0,1,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1,2,1,1,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,1,0,2,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,3,0,1,1,1,1,0,0,1,0,1,1,1,0,0,0,0,0,1,0,0,0,0,0,4,2,4,6,4,3,1,0,1,2,1,1,0,1,0,0,0,0,2,0,0,0,0,0,0,1,1,1,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,2,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,2,1,2,2,2,2,1,0,1,2,1,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,2,2,2,2,2,2,3,3,4,5,3,1,2,1,1,1,1,1,1,0,0,0,0,3,3,0,0,1,1,0,1,0,0,0,0), nrow=6) rownames(tdm) <- 1:6 colnames(tdm) <- paste("term", 1:229, sep="") tdm.dist <- dist(tdm) library(dplyr) library(tidyr) tdm.dist %>% as.matrix() %>% # update dist object to a matrix data.frame() %>% # update matrix to a data frame setNames(nm = 1:ncol(.)) %>% # update column names mutate(names1 = 1:nrow(.)) %>% # use rownames as a variable gather(names2, value , -names1) %>% # reshape data filter(names1 <= names2) # keep the values only once # names1 names2 value # 1 1 1 0.000000 # 2 1 2 14.966630 # 3 2 2 0.000000 # 4 1 3 12.449900 # 5 2 3 12.449900 # 6 3 3 0.000000 # 7 1 4 13.490738 # 8 2 4 13.564660 # 9 3 4 5.385165 # 10 4 4 0.000000 # 11 1 5 12.688578 # 12 2 5 12.922848 # 13 3 5 5.830952 # 14 4 5 7.416198 # 15 5 5 0.000000 # 16 1 6 12.369317 # 17 2 6 12.529964 # 18 3 6 5.830952 # 19 4 6 7.937254 # 20 5 6 7.615773 # 21 6 6 0.000000
How to remove outiers from multi columns of a data frame
I would like to get a data frame that contains only data that is within 2 SD per each numeric column. I know how to do it for a single column but how can I do it for a bunch of columns at once? Here is the toy data frame: df <- read.table(text = "target birds wolfs Country 3 21 7 a 3 8 4 b 1 2 8 c 1 2 3 a 1 8 3 a 6 1 2 a 6 7 1 b 6 1 5 c",header = TRUE) Here is the code line for getting only the data that is under 2 SD for a single column(birds).How can I do it for all numeric columns at once? df[!(abs(df$birds - mean(df$birds))/sd(df$birds)) > 2,] target birds wolfs Country 2 3 8 4 b 3 1 2 8 c 4 1 2 3 a 5 1 8 3 a 6 6 1 2 a 7 6 7 1 b 8 6 1 5 c
We can use lapply to loop over the dataset columns and subset the numeric vectors (by using a if/else condition) based on the mean and sd. lapply(df, function(x) if(is.numeric(x)) x[!(abs((x-mean(x))/sd(x))>2)] else x) EDIT: I was under the impression that we need to remove the outliers for each column separately. But, if we need to keep only the rows that have no outliers for the numeric columns, we can loop through the columns with lapply as before, instead of returning 'x', we return the sequence of 'x' and then get the intersect of the list element with Reduce. The numeric index can be used for subsetting the rows. lst <- lapply(df, function(x) if(is.numeric(x)) seq_along(x)[!(abs((x-mean(x))/sd(x))>2)] else seq_along(x)) df[Reduce(intersect,lst),]
I'm guessing that you are trying to filter your data set by checking that all of the numeric columns are within 2 SD (?) In that case I would suggest to create two filters. 1 one that will indicate numeric columns, the second one that will check that all of them within 2 SD. For the second condition, we can use the built in scale function indx <- sapply(df, is.numeric) indx2 <- rowSums(abs(scale(df[indx])) <= 2) == sum(indx) df[indx2,] # target birds wolfs Country # 2 3 8 4 b # 3 1 2 8 c # 4 1 2 3 a # 5 1 8 3 a # 6 6 1 2 a # 7 6 7 1 b # 8 6 1 5 c
loop ordinal regression statistical analysis and save the data R
could you, please, help me with a loop? I am relatively new to R. The short version of the data looks ike this: sNumber blockNo running TrialNo wordTar wordTar1 Freq Len code code2 1 1 1 5 spouse violent 5011 6 1 2 1 1 1 5 violent spouse 17873 7 2 1 1 1 1 5 spouse aviator 5011 6 1 1 1 1 1 5 aviator wife 515 7 1 1 1 1 1 5 wife aviator 87205 4 1 1 1 1 1 5 aviator spouse 515 7 1 1 1 1 1 9 stability usually 12642 9 1 3 1 1 1 9 usually requires 60074 7 3 4 1 1 1 9 requires client 25949 8 4 1 1 1 1 9 client requires 16964 6 1 4 2 2 1 5 grimy cloth 757 5 2 1 2 2 1 5 cloth eats 8693 5 1 4 2 2 1 5 eats whitens 3494 4 4 4 2 2 1 5 whitens woman 18 7 4 1 2 2 1 5 woman penguin 162541 5 1 1 2 2 1 9 pie customer 8909 3 1 1 2 2 1 9 customer sometimes 13399 8 1 3 2 2 1 9 sometimes reimburses 96341 9 3 4 2 2 1 9 reimburses sometimes 65 10 4 3 2 2 1 9 sometimes gangster 96341 9 3 1 I have a code for ordinal regression analysis for one participant for one trial (eye-tracking data - eyeData) that looks like this: #------------set the path and import the library----------------- setwd("/AscTask-3/Data") library(ordinal) #-------------read the data---------------- read.delim(file.choose(), header=TRUE) -> eyeData #-------------extract 1 trial from one participant--------------- ss <- subset(eyeData, sNumber == 6 & runningTrialNo == 21) #-------------delete duplicates = refixations----------------- ss.s <- ss[!duplicated(ss$wordTar), ] #-------------change the raw frequencies to log freq-------------- ss.s$lFreq <- log(ss.s$Freq) #-------------add a new column with sequential numbers as a factor ------------------ ss.s$rankF <- as.factor(seq(nrow(ss.s))) #------------ estimate an ordered logistic regression model - fit ordered logit model---------- m <- clm(rankF~lFreq*Len, data=ss.s, link='probit') summary(m) #---------------get confidence intervals (CI)------------------ (ci <- confint(m)) #----------odd ratios (OR)-------------- exp(coef(m)) The eyeData file is a huge massive of data consisting of 91832 observations with 11 variables. In total there are 41 participants with 78 trials each. In my code I extract data from one trial from each participant to run the anaysis. However, it takes a long time to run the analysis manually for all trials for all participants. Could you, please, help me to create a loop that will read in all 78 trials from all 41 participants and save the output of statistics (I want to save summary(m), ci, and coef(m)) in one file. Thank you in advance!
You could generate a unique identifier for every trial of every particpant. Then you could loop over all unique values of this identifier and subset the data accordingly. Then you run the regressions and save the output as a R object eyeData$uniqueIdent <- paste(eyeData$sNumber, eyeData$runningTrialNo, sep = "-") uniqueID <- unique(eyeData$uniqueIdent) for (un in uniqueID) { ss <- eyeData[eyeData$uniqueID == un,] ss <- ss[!duplicated(ss$wordTar), ] #maybe do this outside the loop ss$lFreq <- log(ss$Freq) #you could do this outside the loop too #create DV ss$rankF <- as.factor(seq(nrow(ss))) m <- clm(rankF~lFreq*Len, data=ss, link='probit') seeSumm <- summary(m) ci <- confint(m) oddsR <- exp(coef(m)) save(seeSumm, ci, oddsR, file = paste("toSave_", un, ".Rdata", sep = "")) # add -un- to the output file to be able identify where it came from } Variations of this could include combining the output of every iteration in a list (create an empty list in the beginning) and then after running the estimations and the postestimation commands combine the elements in a list and recursively fill the previously created list "gatherRes": gatherRes <- vector(mode = "list", length = length(unique(eyeData$uniqueIdent) ##before the loop gatherRes[[un]] <- list(seeSum, ci, oddsR) ##last line inside the loop If you're concerned with speed, you could consider writing a function that does all this and use lapply (or mclapply).
Here is a solution using the plyr package (it should be faster than a for loop). Since you don't provide a reproducible example, I'll use the iris data as an example. First make a function to calculate your statistics of interest and return them as a list. For example: # Function to return summary, confidence intervals and coefficients from lm lm_stats = function(x){ m = lm(Sepal.Width ~ Sepal.Length, data = x) return(list(summary = summary(m), confint = confint(m), coef = coef(m))) } Then use the dlply function, using your variables of interest as grouping data(iris) library(plyr) #if not installed do install.packages("plyr") #Using "Species" as grouping variable results = dlply(iris, c("Species"), lm_stats) This will return a list of lists, containing output of summary, confint and coef for each species. For your specific case, the function could look like (not tested): ordFit_stats = function(x){ #Remove duplicates x = x[!duplicated(x$wordTar), ] # Make log frequencies x$lFreq <- log(x$Freq) # Make ranks x$rankF <- as.factor(seq(nrow(x))) # Fit model m <- clm(rankF~lFreq*Len, data=x, link='probit') # Return list of statistics return(list(summary = summary(m), confint = confint(m), coef = coef(m))) } And then: results = dlply(eyeData, c("sNumber", "TrialNo"), ordFit_stats)
R - efficient comparison of subsets of rows between data frames
thank you for any help. I need to check the total number of matches from the elements of each row of a data frame (df1) on rows of another data frame (df2). The data frames have different number of columns (5 in the first one versus 6 in the second one, for instance). And there is no exact formation rule for the rows (so I can not find a way of doing this through combinatory analysis) This routine must check all the rows from the first data frame against all the rows of the second data frame, resulting a total number of occurences by the number of hits. Not all the possible sums are of interest. Actually I am looking for a specific total (which I call "hits" in this text). In other words: how many times a subset of each row of df2 of size "hits" can be found in rows of df1. Here is an example: > ### Example > ### df1 and df2 here are regularly formed just for illustration purposes > > require(combinat) > > df1 <- as.data.frame(t(combn(6,5))) > df2 <- as.data.frame(t(combn(7,6))) > > df1 V1 V2 V3 V4 V5 1 1 2 3 4 5 2 1 2 3 4 6 3 1 2 3 5 6 4 1 2 4 5 6 5 1 3 4 5 6 6 2 3 4 5 6 > > df2 V1 V2 V3 V4 V5 V6 1 1 2 3 4 5 6 2 1 2 3 4 5 7 3 1 2 3 4 6 7 4 1 2 3 5 6 7 5 1 2 4 5 6 7 6 1 3 4 5 6 7 7 2 3 4 5 6 7 > In this example, please note, for instance, that subsets of size 5, from row #1 of df2 can be found 6 times in the rows of df1. And so on. I tried something like this: > ### Check how many times subsets of size "hits" from rows from df2 are found in rows of df1 > > myfn <- function(dfa,dfb,hits) { + sapply(c(1:dim(dfb)[1]),function(y) { sum(c(apply(dfa,1,function(x,i) { sum(x %in% dfb[i,]) },i=y))==hits) }) + } > > r1 <- myfn(df1,df2,5) > > cbind(df2,"hits.eq.5" = r1) V1 V2 V3 V4 V5 V6 hits.eq.5 1 1 2 3 4 5 6 6 2 1 2 3 4 5 7 1 3 1 2 3 4 6 7 1 4 1 2 3 5 6 7 1 5 1 2 4 5 6 7 1 6 1 3 4 5 6 7 1 7 2 3 4 5 6 7 1 This seems to do what I need, but it is too slow! I need using this routine on large data frames (about 200 K rows) I am currently using R 3.1.2 GUI 1.65 Mavericks build (6833) Can anyone provide a faster or more clever way of doing this? Than you again. Best regards, Vaccaro
Using apply(...) on data frames is very inefficient. This is because apply(...) takes a matrix as argument, so if you pass a data frame it will coerce that to a matrix. In your example you convert df1 to a matrix every time you call apply(...), which is nrow(df2) times. Also, by using sapply(1:nrow(df2),...) and dfb[i,] you are using data frame row indexing, which is also very inefficient. You are much better off converting everything to matrix class at the beginning and then using apply(...) twice. Finally, there is no reason to use a call to c(...). apply(...) already returns a vector (in this case), so you are just incurring the overhead of another function call to no effect. Doing these things alone speeds up your code by about a factor of 20. set.seed(1) nrows <- 100 df1 <- data.frame(matrix(sample(1:5,5*nrows,replace=TRUE),nc=5)) df2 <- data.frame(matrix(sample(1:6,6*nrows,replace=TRUE),nc=6)) myfn <- function(dfa,dfb,hits) { sapply(c(1:dim(dfb)[1]),function(y) { sum(c(apply(dfa,1,function(x,i) { sum(x %in% dfb[i,]) },i=y))==hits) }) } myfn.2 <- function(dfa,dfb,hits) { ma <- as.matrix(dfa) mb <- as.matrix(dfb) apply(mb,1,function(y) { sum(apply(ma,1,function(x) { sum(x %in% y) })==hits) }) } system.time(r1<-myfn(df1,df2,3)) # user system elapsed # 1.99 0.00 2.00 system.time(r2<-myfn.2(df1,df2,3)) # user system elapsed # 0.09 0.00 0.10 identical(r1,r2) # [1] TRUE There is another approach which takes advantage of the fact that R is extremely efficient at manipulating lists. Since a data frame is just a list of vectors, we can improve performance by putting your rows into data frame columns and then using sapply(..) on that. This is faster than myfn.2(...) above, but only by about 20%. myfn.3 <-function(dfa,dfb,hits) { df1.t <- data.frame(t(dfa)) # rows into columns df2.t <- data.frame(t(dfb)) sapply(df2.t,function(col2)sum(sapply(df1.t,function(col1)sum(col1 %in% col2)==hits))) } library(microbenchmark) microbenchmark(myfn.2(df1,df2,5),myfn.3(df1,df2,5),times=10) # Unit: milliseconds # expr min lq median uq max neval # myfn.2(df1, df2, 5) 92.84713 94.06418 96.41835 98.44738 99.88179 10 # myfn.3(df1, df2, 5) 75.53468 77.44348 79.24123 82.28033 84.12457 10 If you really have a dataset with 55MM rows, then I think you need to rethink this problem. I have no idea what you are trying to accomplish, but this seems like a brute force approach.
Computing difference between rows in a data frame
I have a data frame. I would like to compute how "far" each row is from a given row. Let us consider it for the 1st row. Let the data frame be as follows: > sampleDF X1 X2 X3 1 5 5 4 2 2 2 9 1 7 7 3 What I wish to do is the following: Compute the difference between the 1st row & others: sampleDF[1,]-sampleDF[2,] Consider only the absolute value: abs(sampleDF[1,]-sampleDF[2,]) Compute the sum of the newly formed data frame of differences: rowSums(newDF) Now to do this for the whole data frame. newDF <- sapply(2:4,function(x) { return (abs(sampleDF[1,]-sampleDF[x,]));}) This creates a problem in that the result is a transposed list. Hence, newDF <- as.data.frame(t(sapply(2:4,function(x) { return (abs(sampleDF[1,]-sampleDF[x,]));}))) But another problem arises while computing rowSums: > class(newDF) [1] "data.frame" > rowSums(newDF) Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...) : 'x' must be numeric > newDF X1 X2 X3 1 3 3 3 2 1 4 4 3 6 2 2 > Puzzle 1: Why do I get this error? I did notice that newDF[1,1] is a list & not a number. Is it because of that? How can I ensure that the result of the sapply & transpose is a simple data frame of numbers? So I proceed to create a global data frame & modify it within the function: sapply(2:4,function(x) { newDF <<- as.data.frame(rbind(newDF,abs(sampleDF[1,]-sampleDF[x,])));}) > newDF X1 X2 X3 2 3 3 3 3 1 4 4 4 6 2 2 > rowSums(outDF) 2 3 4 9 9 10 > This is as expected. Puzzle 2: Is there a cleaner way to achieve this? How can I do this for every row in the data frame (shown above is only for "distance" from row 1. Would need to do this for other rows as well)? Is running a loop the only option?
To put it in words, you are trying to compute the Manhattan distance: dist(sampleDF, method = "Manhattan") # 1 2 3 # 2 9 # 3 9 10 # 4 10 9 9 Regarding your implementation, I think the problem is that your inner function is returning a data.frame when it should return a numeric vector. Doing return(unlist(abs(sampleDF[1,]-sampleDF[x,]))) should fix it.