I have multiple .csv files in a folder. I would like to select every possible pair and make some calculations. Here is an example files names:
files <- c("/Users/st/Desktop/Form_Number_1.csv",
"/Users/st/Desktop/Form_Number_2.csv",
"/Users/st/Desktop/Form_Number_3.csv",
"/Users/st/Desktop/Form_Number_4.csv")
For each pair, I would like to merge them by id and calculate the correlation and store them.
so, manually,
dat1 <- read_csv("/Users/st/Desktop/Form_Number_1.csv")
dat2 <- read_csv("/Users/st/Desktop/Form_Number_2.csv")
dat.merge <- merge(dat1, dat2, by = "id")
correlation <- cor(dat.merge$score.x, dat.merge$score.y)
How can I do this at once?
combn is your friend here.
alldat <- map(files, read_csv)
combos <- combn(1:length(alldat), 2)
This returns a matrix with 2 rows and length(alldat) columns, each column containing one unique combination of the numbers 1..length(alldat). We next create a function that calculates the correlation coefficient from two sets, and apply it to every column.
calc_func <- function(dat1, dat2) {
dat.merge <- merge(dat1, dat2, by = "id")
cor(dat.merge$score.x, dat.merge$score.y)
}
results <- apply(combos, 2, \(x) calc_func(alldat[[ x[1] ]],
alldat[[ x[2] ]]))
That said, I am not a fan of this approach. It would be more elegant and efficient to simply extract the score column from each of the data frames and then calculate the correlation coefficients with one call to cor:
library(tidyverse)
scores <- map(alldat, ~ .x$score) %>% reduce(cbind)
cor(scores)
Related
I looked at the variations of this question, but was not able to find an answer that worked...so here goes. I have a lot of data frames, each representing a psychological index (the kind where they ask several questions and the average of them all gives you a score on what it is that you are measuring (anger, anxiety, etc). For this example, I will choose three of them: SA, SE, GT
I would like to make a for loop to automatically calculate the average of the columns in each data frame, and then add a new column with that average.
I was able to make a for loop to do this for one data frame, but how do I then loop this loop to do it for all of my data frames (which is a lot more than 3)?
#This is the for loop to do it for just one data frame (SA)
avg <- c()
for(i in 1:nrow(SA)){
avg[i] <- (sum(as.numeric(SA[i,]), na.rm =T)/ncol(SA))
}
SA$avg <- avg
#This is what I tried to do for multiple:
my.list <- list(SA, SE, GT)
for(l in my.list){
avg <- c()
for(i in 1:nrow(l)){
avg[i] <- (sum(as.numeric(l[i,]), na.rm =T)/ncol(l))
}
l$avg <- avg
}
This may work for you. I've created some dummy data frames, assuming that you have the same number of observations for each psychological index. You then bash them all together into one big dataframe. The colMeans function will compute means for each column:
SA <- data.frame(SA=runif(10))
SE <- data.frame(SE=runif(10))
GT <- data.frame(GT=runif(10))
MP <- data.frame(MP=runif(10))
df <- cbind(SA, SE, GT, MP)
av <- colMeans(df, na.rm = TRUE)
If the indices have differing numbers of observations, you can combine them into a list as you did, and then use the function sapply(). Since each element of the list is a dataframe, you need to extract the actual column by using the index operator [, 1] (first column):
df <- list(SA, SE, GT, MP)
sapply(df, function(x) mean(x[,1], na.rm=TRUE))
UPDATE:
You can create a list of your dataframes again, but as you need means across rows, just use the rowMeans() function:
SA <- data.frame(matrix(runif(50), nrow=10))
SE <- data.frame(matrix(runif(80), nrow=10))
df <- list(SA, SE)
lapply(df, function(x) {x$index_means <- rowMeans(x, na.rm=TRUE); return(x) })
This will give you a list of data frames with a new column of means for each index.
I have ten datasets, and each dataset contains "ratings" and "occupation" columns. From each of those ten datasets I want to find out the "average" of "ratings" per three occupation groups (i.e. artists, technician, marketing).
The code I have written is as follows:
Average.Rating.per.Interval <- data.frame(interval=as.numeric(),
occupation=as.character(),
average.rating=as.numeric(),
stringsAsFactors=FALSE)
##interval number refers to the dataset number (e.g. for 'e.1' it is 1, for 'e.2' it's 2)
Average.Rating.per.Interval <- as.matrix(Average.Rating.per.Interval)
e.1.artist <- e.1[which(e.1[,"occupation"]=='artist', arr.ind = TRUE),]
mean(e.1.artist$rating)
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval,
c(interval=1,occupation="artist",average.rating=mean(e.1.artist$rating)))
e.1.technician <- e.1[which(e.1[,"occupation"]=='technician', arr.ind = TRUE),]
mean(e.1.technician$rating)
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval,
c(1,"technician",mean(e.1.technician$rating)))
e.1.marketing <- e.1[which(e.1[,"occupation"]=='marketing', arr.ind = TRUE),]
mean(e.1.marketing$rating)
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval,
c(1,"marketing",mean(e.1.marketing$rating)))
This is clearly not efficient at all, because for ten datasets, I have to rewrite the same code 9 more times to get the average ratings for each of those occupations groups for all of my ten datasets. Is there a better way to do this? I cannot think of anything better! I found out that apply/lapply can be a way to do this, but I could not figure out how they can work for my case.
Two of my datasets (e1 and e2) can be found here. (I have only included 10% of the entire observations in each)
You can use the tidyverse package to summarize each of your data frames. First, you'll want to put them in a list. Then you can iterate over each of the data frames in the list, summarizing by occupation:
library(tidyverse)
# Create sample data
set.seed(2353)
sample_data <- rerun(10, tibble(
occupation = sample(c("Artist", "Technician", "Marketing"), 100, replace = TRUE),
ratings = sample(1:100, 100, replace = TRUE)
))
# Summarize by occupation
summarized_data <- sample_data %>%
map(~ .x %>% group_by(occupation) %>% summarize(avg_rating = mean(ratings)))
Another option, with base. First load the files into a list, then use lapply to calculate the means for each dataset
# Set directory to a file that contains the files
files <- list.files()
# Load all the data at once into a single list
l <- lapply(files, dget)
names(l) <- substr(files, 1, 2) # gives meaningful names to list elements (datasets)
# Calculate the mean by group for each dataset
all_group_means <- lapply(l, function(x) tapply(x$rating, x$occupation, mean, na.rm = TRUE))
# Subset all the group means to just those you're interested in
sapply(all_group_means, function(x) x[c("artist", "technician", "marketing")])
d1 d2
artist 3.540984 3.612048
technician 3.519512 3.651106
marketing 3.147208 3.342569
Note that if your data are already all loaded, you could just put them into a list (rather then loading all the data directly into a list) and then use the lapply function and it should still work.
Edit
I just realized you only wanted the means for the three groups. I've edited the code above to subset all means to only the three groups.
I recommend the "plyr" package for this kind of manipulation; it is well worth the investment of an hour or so to learn. In your case, I loaded your first example dataset in "d1", and I can summarise it like so:
ddply(d1, .(occupation), summarise, mean_rating=mean(rating))
This shows the results for all occupations, and you only wanted a specific three, so we can filter it to those:
ddply(subset(d1, occupation %in% c('artist','technician','marketing')), summarise, mean_rating=mean(rating))
Now we just need to generalize it to running over 10 datasets without cut and paste. Let's store our data frames inside a list:
dataset_list <- list(d1=d1) # you would put all of them here; I just have one
Now we can run the same code on all of them, with lapply, and get a list back out:
filtered_occupations <- c('artist','technician','marketing')
lapply(dataset_list, function(dataset) {
ddply(subset(dataset,occupation %in% filtered_occupations),
.(occupation), summarise, mean_rating=mean(rating))} )
Result:
$d1
occupation mean_rating
1 artist 3.540984
2 marketing 3.147208
3 technician 3.519512
I am trying to subset my data to drop rows with certain values of certain variables. Suppose I have a data frame df with many columns and rows, I want to drop rows based on the values of variables G1 and G9, and I only want to keep rows where those variables take on values of 1, 2, or 3. In this way, I aim to subset on the same values across multiple variables.
I am trying to do this with few lines of code and in a manner that allows quick changes to the variables or values I would like to use. For example, assuming I start with data frame df and want to end with newdf, which excludes observations where G1 and G9 do not take on values of 1, 2, or 3:
# Naive approach (requires manually changing variables and values in each line of code)
newdf <- df[which(df$G1 %in% c(1,2,3), ]
newdf <- df[which(newdf$G9 %in% c(1,2,3), ]
# Better approach (requires manually changing variables names in each line of code)
vals <- c(1,2,3)
newdf <- df[which(df$G1 %in% vals, ]
newdf <- df[which(newdf$G9 %in% vals, ]
If I wanted to not only subset on G1 and G9 but MANY variables, this manual approach would be time-consuming to modify. I want to simplify this even further by consolidating all of the code into a single line. I know the below is wrong but I am not sure how to implement an alternative.
newdf <- c(1,2,3)
newdf <- c(df$G1, df$G9)
newdf <- df[which(df$vars %in% vals, ]
It is my understanding I want to use apply() but I am not sure how.
You do not need to use which with %in%, it returns boolean values. How about the below:
keepies <- (df$G1 %in% vals) & (df$G9 %in% vals)
newdf <- df[keepies, ]
Use data.table
First, melt your data
library(data.table)
DT <- melt.data.table(df)
Then split into lists
DTLists <- split(DT, list(DT[1:9])) #this is the number of columns that you have.
Now you can operate on the lists recursively using lapply
DTresult <- lapply(DTLists, function(x) {
...
}
I'm trying to clean this code up and was wondering if anybody has any suggestions on how to run this in R without a loop. I have a dataset called data with 100 variables and 200,000 observations. What I want to do is essentially expand the dataset by multiplying each observation by a specific scalar and then combine the data together. In the end, I need a data set with 800,000 observations (I have four categories to create) and 101 variables. Here's a loop that I wrote that does this, but it is very inefficient and I'd like something quicker and more efficient.
datanew <- c()
for (i in 1:51){
for (k in 1:6){
for (m in 1:4){
sub <- subset(data,data$var1==i & data$var2==k)
sub[,4:(ncol(sub)-1)] <- filingstat0711[i,k,m]*sub[,4:(ncol(sub)-1)]
sub$newvar <- m
datanew <- rbind(datanew,sub)
}
}
}
Please let me know what you think and thanks for the help.
Below is some sample data with 2K observations instead of 200K
# SAMPLE DATA
#------------------------------------------------#
mydf <- as.data.frame(matrix(rnorm(100 * 20e2), ncol=20e2, nrow=100))
var1 <- c(sapply(seq(41), function(x) sample(1:51)))[1:20e2]
var2 <- c(sapply(seq(2 + 20e2/6), function(x) sample(1:6)))[1:20e2]
#----------------------------------#
mydf <- cbind(var1, var2, round(mydf[3:100]*2.5, 2))
filingstat0711 <- array(round(rnorm(51*6*4)*1.5 + abs(rnorm(2)*10)), dim=c(51,6,4))
#------------------------------------------------#
You can try the following. Notice that we replaced the first two for loops with a call to mapply and the third for loop with a call to lapply.
Also, we are creating two vectors that we will combine for vectorized multiplication.
# create a table of the i-k index combinations using `expand.grid`
ixk <- expand.grid(i=1:51, k=1:6)
# Take a look at what expand.grid does
head(ixk, 60)
# create two vectors for multiplying against our dataframe subset
multpVec <- c(rep(c(0, 1), times=c(4, ncol(mydf)-4-1)), 0)
invVec <- !multpVec
# example of how we will use the vectors
(multpVec * filingstat0711[1, 2, 1] + invVec)
# Instead of for loops, we can use mapply.
newdf <-
mapply(function(i, k)
# The function that you are `mapply`ing is:
# rbingd'ing a list of dataframes, which were subsetted by matching var1 & var2
# and then multiplying by a value in filingstat
do.call(rbind,
# iterating over m
lapply(1:4, function(m)
# the cbind is for adding the newvar=m, at the end of the subtable
cbind(
# we transpose twice: first the subset to multiply our vector.
# Then the result, to get back our orignal form
t( t(subset(mydf, var1==i & mydf$var2==k)) *
(multpVec * filingstat0711[i,k,m] + invVec)),
# this is an argument to cbind
"newvar"=m)
)),
# the two lists you are passing as arguments are the columns of the expanded grid
ixk$i, ixk$k, SIMPLIFY=FALSE
)
# flatten the data frame
newdf <- do.call(rbind, newdf)
Two points to note:
Try not to use words like data, table, df, sub etc which are commonly used functions
In the above code I used mydf in place of data.
You can use apply(ixk, 1, fu..) instead of the mapply that I used, but I think mapply makes for cleaner code in this situation
I have a large data.frame, and I'd like to be able to reduce it by using a quantile subset by one of the variables. For example:
x <- c(1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10)
df <- data.frame(x,rnorm(100))
df2 <- subset(df, df$x == 1)
df3 <- subset(df2, df2[2] > quantile(df2$rnorm.100.,0.8))
What I would like to end up with is a data.frame that contains all quantiles for x=1,2,3...10.
Is there a way to do this with ddply?
You could try:
ddply(df, .(x), subset, rnorm.100. > quantile(rnorm.100., 0.8))
And off topic: you could use df <- data.frame(x,y=rnorm(100)) to name a column on-the-fly.
Here's a different approach with the little used ave() command. (very fast to calculate this way)
Make a new column that contains the quantile calculation across each level of x
df$quantByX <- ave(df$rnorm.100., df$x, FUN = function (x) quantile(x,0.8))
Select the items of the new column and the x column.
df2 <- unique(df[,c(1,3)])
The result is one data frame with the unique items in the x column and the calculated quantile for each level of x.