I have ten datasets, and each dataset contains "ratings" and "occupation" columns. From each of those ten datasets I want to find out the "average" of "ratings" per three occupation groups (i.e. artists, technician, marketing).
The code I have written is as follows:
Average.Rating.per.Interval <- data.frame(interval=as.numeric(),
occupation=as.character(),
average.rating=as.numeric(),
stringsAsFactors=FALSE)
##interval number refers to the dataset number (e.g. for 'e.1' it is 1, for 'e.2' it's 2)
Average.Rating.per.Interval <- as.matrix(Average.Rating.per.Interval)
e.1.artist <- e.1[which(e.1[,"occupation"]=='artist', arr.ind = TRUE),]
mean(e.1.artist$rating)
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval,
c(interval=1,occupation="artist",average.rating=mean(e.1.artist$rating)))
e.1.technician <- e.1[which(e.1[,"occupation"]=='technician', arr.ind = TRUE),]
mean(e.1.technician$rating)
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval,
c(1,"technician",mean(e.1.technician$rating)))
e.1.marketing <- e.1[which(e.1[,"occupation"]=='marketing', arr.ind = TRUE),]
mean(e.1.marketing$rating)
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval,
c(1,"marketing",mean(e.1.marketing$rating)))
This is clearly not efficient at all, because for ten datasets, I have to rewrite the same code 9 more times to get the average ratings for each of those occupations groups for all of my ten datasets. Is there a better way to do this? I cannot think of anything better! I found out that apply/lapply can be a way to do this, but I could not figure out how they can work for my case.
Two of my datasets (e1 and e2) can be found here. (I have only included 10% of the entire observations in each)
You can use the tidyverse package to summarize each of your data frames. First, you'll want to put them in a list. Then you can iterate over each of the data frames in the list, summarizing by occupation:
library(tidyverse)
# Create sample data
set.seed(2353)
sample_data <- rerun(10, tibble(
occupation = sample(c("Artist", "Technician", "Marketing"), 100, replace = TRUE),
ratings = sample(1:100, 100, replace = TRUE)
))
# Summarize by occupation
summarized_data <- sample_data %>%
map(~ .x %>% group_by(occupation) %>% summarize(avg_rating = mean(ratings)))
Another option, with base. First load the files into a list, then use lapply to calculate the means for each dataset
# Set directory to a file that contains the files
files <- list.files()
# Load all the data at once into a single list
l <- lapply(files, dget)
names(l) <- substr(files, 1, 2) # gives meaningful names to list elements (datasets)
# Calculate the mean by group for each dataset
all_group_means <- lapply(l, function(x) tapply(x$rating, x$occupation, mean, na.rm = TRUE))
# Subset all the group means to just those you're interested in
sapply(all_group_means, function(x) x[c("artist", "technician", "marketing")])
d1 d2
artist 3.540984 3.612048
technician 3.519512 3.651106
marketing 3.147208 3.342569
Note that if your data are already all loaded, you could just put them into a list (rather then loading all the data directly into a list) and then use the lapply function and it should still work.
Edit
I just realized you only wanted the means for the three groups. I've edited the code above to subset all means to only the three groups.
I recommend the "plyr" package for this kind of manipulation; it is well worth the investment of an hour or so to learn. In your case, I loaded your first example dataset in "d1", and I can summarise it like so:
ddply(d1, .(occupation), summarise, mean_rating=mean(rating))
This shows the results for all occupations, and you only wanted a specific three, so we can filter it to those:
ddply(subset(d1, occupation %in% c('artist','technician','marketing')), summarise, mean_rating=mean(rating))
Now we just need to generalize it to running over 10 datasets without cut and paste. Let's store our data frames inside a list:
dataset_list <- list(d1=d1) # you would put all of them here; I just have one
Now we can run the same code on all of them, with lapply, and get a list back out:
filtered_occupations <- c('artist','technician','marketing')
lapply(dataset_list, function(dataset) {
ddply(subset(dataset,occupation %in% filtered_occupations),
.(occupation), summarise, mean_rating=mean(rating))} )
Result:
$d1
occupation mean_rating
1 artist 3.540984
2 marketing 3.147208
3 technician 3.519512
Related
I have multiple .csv files in a folder. I would like to select every possible pair and make some calculations. Here is an example files names:
files <- c("/Users/st/Desktop/Form_Number_1.csv",
"/Users/st/Desktop/Form_Number_2.csv",
"/Users/st/Desktop/Form_Number_3.csv",
"/Users/st/Desktop/Form_Number_4.csv")
For each pair, I would like to merge them by id and calculate the correlation and store them.
so, manually,
dat1 <- read_csv("/Users/st/Desktop/Form_Number_1.csv")
dat2 <- read_csv("/Users/st/Desktop/Form_Number_2.csv")
dat.merge <- merge(dat1, dat2, by = "id")
correlation <- cor(dat.merge$score.x, dat.merge$score.y)
How can I do this at once?
combn is your friend here.
alldat <- map(files, read_csv)
combos <- combn(1:length(alldat), 2)
This returns a matrix with 2 rows and length(alldat) columns, each column containing one unique combination of the numbers 1..length(alldat). We next create a function that calculates the correlation coefficient from two sets, and apply it to every column.
calc_func <- function(dat1, dat2) {
dat.merge <- merge(dat1, dat2, by = "id")
cor(dat.merge$score.x, dat.merge$score.y)
}
results <- apply(combos, 2, \(x) calc_func(alldat[[ x[1] ]],
alldat[[ x[2] ]]))
That said, I am not a fan of this approach. It would be more elegant and efficient to simply extract the score column from each of the data frames and then calculate the correlation coefficients with one call to cor:
library(tidyverse)
scores <- map(alldat, ~ .x$score) %>% reduce(cbind)
cor(scores)
I'd like to use a loop function to recognise names from a list/dataframe as an actual list/dataframe name in the R script (for data analysis or manipulation).
I will create some pseudo data to try to help show what i'm trying to do.
Here is code to create 3 lists
height <- sample(120:200,200,TRUE)
weight <- sample(40:140,200,TRUE)
income <- sample(20000:200000,200, TRUE)
This code creates a list containing those list names
vars <- c("height","weight","income")
The code below doesn't run, but I would like to use a loop code like this, where it takes the name from the list position and uses it in script as a list name. Thus it's using the name to calculate the mean, and it's using the name to create a new object.
for (i in 1:3)
{mean_**vars[i]** = mean(**vars[i]**) }
The result should be 3 objects "mean_height", "mean_weight", "mean_income" which contain the mean scores
I'm not so much interested in the calculating of mean scores, I'm interested in the ability to use the names from the list. I want to be able to expand this to other analyses that are repetitive.
Apologies if above hasn't been articulated too well, I'm quite new to R, so I hope it makes some sense.
Any help will be most useful, or if you can point me in the right direction that would be great.
This may be what you're looking for, where lapply applies the mean function to each of the items in vars (a list of dataframes). Note that you want to make the list of dataframes using the variable names.
height <- sample(120:200,200,TRUE)
weight <- sample(40:140,200,TRUE)
income <- sample(20000:200000,200, TRUE)
vars <- list(height, weight, income)
lapply(vars, function(x) mean(x))
Then create an output dataframe using that:
df1 <- data.frame(lapply(vars, function(x) mean(x)))
colnames(df1) <- c("mean_height", "mean_weight", "mean_income")
df1
From your additional comment, using vars <- list(height, weight, income) should allow you do this:
mean(height)
mean(vars[[1]])
[1] 160.48
[1] 160.48
This should work to output dynamically named variables:
vars <- list(height = height, weight = weight, income = income)
for (i in names(vars)){
assign(paste("mean_", i, sep = ""), mean(vars[[i]]))
}
mean_height
mean_weight
mean_income
[1] 163.28
[1] 90.465
[1] 109686.5
However, I'd suggest not programming that way since it can cause issues and it's not very scalable. E.g., you could end up with 10000 variables.
I guess what you want is something like below, which produces three objects into your global environment for the means of weight, height, and income from list list, i.e.,
list2env(setNames(Map(mean,lst),paste0("mean_",names(lst))),envir = .GlobalEnv)
DATA
height <- sample(120:200,200,TRUE)
weight <- sample(40:140,200,TRUE)
income <- sample(20000:200000,200, TRUE)
lst <- list(height,weight,income)
A more common approach in R is to use lists of data, rather than separate variables.
Like this:
# make this reproducible
set.seed(123)
# make an empty list for the data
raw_data <- list()
# then fill the list. The data can be of varying length in a list.
raw_data$height <- sample(120:200,200,TRUE)
raw_data$weight <- sample(40:140,200,TRUE)
raw_data$income <- sample(20000:200000,200, TRUE)
Then looping becomes a one-liner and your names are preserved, using the *apply family of functions:
mean_data <- lapply(raw_data, mean)
# print that
mean_data
$height
[1] 159.06
$weight
[1] 90.83
$income
[1] 114000.7
Note what we didn't have to do:
know the number of variables.
have variables all the same length.
build a loop and keep track of names.
All handled automagically. Nice.
I looked at the variations of this question, but was not able to find an answer that worked...so here goes. I have a lot of data frames, each representing a psychological index (the kind where they ask several questions and the average of them all gives you a score on what it is that you are measuring (anger, anxiety, etc). For this example, I will choose three of them: SA, SE, GT
I would like to make a for loop to automatically calculate the average of the columns in each data frame, and then add a new column with that average.
I was able to make a for loop to do this for one data frame, but how do I then loop this loop to do it for all of my data frames (which is a lot more than 3)?
#This is the for loop to do it for just one data frame (SA)
avg <- c()
for(i in 1:nrow(SA)){
avg[i] <- (sum(as.numeric(SA[i,]), na.rm =T)/ncol(SA))
}
SA$avg <- avg
#This is what I tried to do for multiple:
my.list <- list(SA, SE, GT)
for(l in my.list){
avg <- c()
for(i in 1:nrow(l)){
avg[i] <- (sum(as.numeric(l[i,]), na.rm =T)/ncol(l))
}
l$avg <- avg
}
This may work for you. I've created some dummy data frames, assuming that you have the same number of observations for each psychological index. You then bash them all together into one big dataframe. The colMeans function will compute means for each column:
SA <- data.frame(SA=runif(10))
SE <- data.frame(SE=runif(10))
GT <- data.frame(GT=runif(10))
MP <- data.frame(MP=runif(10))
df <- cbind(SA, SE, GT, MP)
av <- colMeans(df, na.rm = TRUE)
If the indices have differing numbers of observations, you can combine them into a list as you did, and then use the function sapply(). Since each element of the list is a dataframe, you need to extract the actual column by using the index operator [, 1] (first column):
df <- list(SA, SE, GT, MP)
sapply(df, function(x) mean(x[,1], na.rm=TRUE))
UPDATE:
You can create a list of your dataframes again, but as you need means across rows, just use the rowMeans() function:
SA <- data.frame(matrix(runif(50), nrow=10))
SE <- data.frame(matrix(runif(80), nrow=10))
df <- list(SA, SE)
lapply(df, function(x) {x$index_means <- rowMeans(x, na.rm=TRUE); return(x) })
This will give you a list of data frames with a new column of means for each index.
I have gotten instructions to do an analysis in R with the vegan package (concerning DCA's).
The instructions on a single dataframe are pretty straightforward, but I would like to apply the analysis on a set of dataframes.
I know this can be done with a for-loop or lapply or sapply, but I have trouble dealing with the fact that each step of the analysis a new extension is added to the name of the dataframe.
An example below
Say I have a dataframe DF, then it goes as follows:
DF.t1 <- decostand(DF, "total")
DF.t2 <- decostand(DF.t1, "max")
DF.t2.dca <- decorana(DF.t2)
DF.t2.dca.DW <- decorana(DF.t2, iweigh=1)
names(DF.t2.dca)
summary(DF.t2.dca)
DF.t2.dca.taxonscores <- scores(DF.t2.dca, display=c("species"), choices=c(1,2))
DF.t2.dca.taxonscores <- DF.t2.dca$cproj[ ,1:2]
DF.t2.dca.samplescores <- scores(DF.t2.dca, display=c("sites"), choices=1)
What I want to achieve is to run several dataframes through this analysis without writing it all out separately.
Let's say I have a set of dataframes called "DF_1", "DF_2" & "DF_3" which I want to do this analysis on.
I probably need to put the dataframes in a list, and get all the steps in a for-loop or one of the apply methods.
But how do I approach the problem with the extensions added (.ra, .t1, .t2, .t2.dca, .t2.dca.DW etc.) to the dataframe names?
Edit: I need to retain the original dataframes after the analysis, in order to do follow-up analysis on them.
Unless you have a very limited amount of data frames, I would not advise to define ca. 8 new objects for each data frame in the global environment as this can become very messy.
One approach you might consider is creating a nested list where the first level is the data frame and the second level are the modified data frames.
# some example data sets
DF1 <- mtcars
DF2 <- mtcars*2
DF3 <- mtcars*3
all_dfs <- list(DF1 = DF1, DF2 = DF2, DF3 =DF3)
some_stuff <- function(df) {
DF.t1 <- decostand(df, "total")
DF.t2 <- decostand(DF.t1, "max")
DF.t2.dca <- decorana(DF.t2)
DF.t2.dca.DW <- decorana(DF.t2, iweigh=1)
names(DF.t2.dca)
summary(DF.t2.dca)
DF.t2.dca.taxonscores <- scores(DF.t2.dca, display=c("species"), choices=c(1,2))
DF.t2.dca.taxonscores <- DF.t2.dca$cproj[ ,1:2]
DF.t2.dca.samplescores <- scores(DF.t2.dca, display=c("sites"), choices=1)
return(list(DF.t1 = DF.t1, DF.t2 = DF.t2,
DF.t2.dca = DF.t2.dca,
DF.t2.dca.DW = DF.t2.dca.DW,
DF.t2.dca.taxonscores = DF.t2.dca.taxonscores,
DF.t2.dca.taxonscores = DF.t2.dca.taxonscores
))
}
nested_list <- lapply(all_dfs, some_stuff)
# To obtain any of the objects for a specific data.frame you could, for example, run
nested_list$DF1$DF.t2.dca.DW
Here is my problem. I have a dataset with 200k rows.
Each row corresponds to a test conducted on a subject.
Subjects have unequal number of tests.
Each test is dated.
I want to assign an index to each test. E.g. The first test of subject 1 would be 1, the second test of subject 1 would be 2. The first test of subject 2 would be 1 etc..
My strategy is to get a list of unique Subject IDs, use lapply to subset the dataset into a list of dataframes using the unique Subject IDs, with each Subject having his/her own dataframe with the tests conducted. Ideally I would then be able to sort each dataframe of each subject and assign an index for each test.
However, doing this over a 200k x 32 dataframe made my laptop (i5, Sandy Bridge, 4GB ram) run out of memory quite quickly.
I have 2 questions:
Is there a better way to do this?
If there is not, my only option to overcome the memory limit is to break my unique SubjectID list into smaller sets like 1000 SubjectIDs per list, lapply it through the dataset and at the end of everything, join the lists together. Then, how do I create a function to break my SubjectID list by supplying say an integer that denotes the number of partitions. e.g. BreakPartition(Dataset, 5) will break the dataset into 5 partitions equally.
Here is code to generate some dummy data:
UniqueSubjectID <- sapply(1:500, function(i) paste(letters[sample(1:26, 5, replace = TRUE)], collapse =""))
UniqueSubjectID <- subset(UniqueSubjectID, !duplicated(UniqueSubjectID))
Dataset <- data.frame(SubID = sample(sapply(1:500, function(i) paste(letters[sample(1:26, 5, replace = TRUE)], collapse ="")),5000, replace = TRUE))
Dates <- sample(c(dates = format(seq(ISOdate(2010,1,1), by='day', length=365), format='%d.%m.%Y')), 5000, replace = TRUE)
Dataset <- cbind(Dataset, Dates)
I would guess that the splitting/lapply is what is using up the memory. You should consider a more vectorized approach. Starting with a slightly modified version of your example code:
n <- 200000
UniqueSubjectID <- replicate(500, paste(letters[sample(26, 5, replace=TRUE)], collapse =""))
UniqueSubjectID <- unique(UniqueSubjectID)
Dataset <- data.frame(SubID = sample(UniqueSubjectID , n, replace = TRUE))
Dataset$Dates <- sample(c(dates = format(seq(ISOdate(2010,1,1), by='day', length=365), format='%d.%m.%Y')), n, replace = TRUE)
And assuming that what you want is an index counting the tests by date order by subject, you could do the following.
Dataset <- Dataset[order(Dataset$SubID, Dataset$Dates), ]
ids.rle <- rle(as.character(Dataset$SubID))
Dataset$SubIndex <- unlist(sapply(ids.rle$lengths, function(n) 1:n))
Now the 'SubIndex' column in 'Dataset' contains a by-subject numbered index of the tests. This takes a very small amount of memory and runs in a few seconds on my 4GB Core 2 duo Laptop.
This sounds like a job for the plyr package. I would add the index in this way:
require(plyr)
system.time(new_dat <- ddply(Dataset, .(SubID), function(dum) {
dum = dum[order(dum$SubID, dum$Dates), ]
mutate(dum, index = 1:nrow(dum))
}))
This splits the dataset up into chunks per SubID, and adds an index. The new object has all the SubID grouped together, and sorted in time. Your example took about 2 seconds on my machine, and used almost no memory. I'm not sure how ddply scales to your data size and characteristics, but you could try. I this does not work fast enough, definitely take a look at the data.table package. A blog post of mine which compares (among others) ddply and data.table could serve as some inspiration.