R Aggregate with a yet undefined range of columns (including factors) - r

I probably miss the right words to find my answer using the search function. I will have a dataset with a yet unknown number of columns, because they are a function of work within another program and later changes there will change the number of variables in the dataset. However, the dataset has a clear structure, with 6 variables in the beginning (including the below mentioned code, a factor variable, and year and starting at the 7 column all the other variables that are a function of the work in the other program (MaxQDA).
So I wish to have a flexible call for 7 to N columns for an aggregate function to replace the dot in the following code, which to my understanding calls for all columns.
dataset2 <- aggregate(. ~ code+jahr,
data = dataset,
sum,
na.action=na.pass
)
Suggestions from here do not help, as I don't know how to transfer the code+jahr into other suggested variations of aggregate-function writing.
addendum: Or, put differently: I wish to exempt a few columns from the aggregate-function, while summing up a range of other columns.
Since there was confusion about vector types. I have some factor data like ID and Name. Data would look like this
set.seed(42)
test2 <- as.data.frame(matrix(sample(16 * 4, replace=TRUE), ncol=16, nrow=4))
code <-c("aaa", "bbb","aaa", "ddd")
jahr <- c("1990", "1993", "2007", "2020")
id <- c("id1", "id2", "id3", "id4")
Name <- c("bla", "bla2", "bla3", "bla4")
test <- data.frame(code, jahr, id, Name)
dataset <- data.frame(test, test2)
dataset[1:4] <- lapply(dataset[, 1:4], as.factor)

Using dataset above we want to remove id and Name from the aggregation since they are factors that are not used to define groups. The simplest way to do that is to extract those columns of data:
dataset2 <- aggregate(. ~ code+jahr, data = dataset[ , -(3:4)], sum, na.action=na.pass)
A slightly more complicated method is to define a logical statement that identifies columns that are factors but not used for grouping. The main advantage is not having to figure out column numbers and making it relatively simple to change the grouping variables:
keep <- colnames(dataset) %in% c("code", "jahr") | sapply(dataset, is.numeric)
dataset2 <- aggregate(. ~ code+jahr, data = dataset[, keep], sum, na.action=na.pass)
Both produce the same results

Related

R: Comparing different versions of data in terms of levels

my aim is to compare differences in levels of variables that might occur across different versions of a dataset. In my code, I first generate strings in order to be able to compare several variables (numeric, categorical, etc.). However, the code fails and does not give the desired results, which would be a data frame that consists of the variable and possible differences (in a list). Any help is appreciated!
Thank you.
data1 <- lapply(?, as.character)
data2 <- lapply(?, as.character)
check_diffs <- function(vars, data1, data2) {
levels1 <- unique(data1$vars)
levels2 <- unique(data2$vars)
diff <- ifelse(length(union(setdiff(levels1,levels2), setdiff(levels2,levels1)))>0, list(union(setdiff(levels1,levels2), setdiff(levels2,levels1))), NA)
return(data.frame(var = vars, diffs = I(diff)))
}
diffs_df <- map_dfr(vars, ~check_diffs(.x, data1 = ?, data2 = ?))
The issue with the code was that vars gives a string, which must be called with get(vars, dataX). Then, the code gives the differences in coding between both data sets.

How to replace several variables with several variables from another dataframe in R using a loop?

I would like to replace multiple variables with variables from a second dataframe in R.
df1$var1 <- df2$var1
df1$var2 <- df2$var2
# and so on ...
As you can see the variable names are the same in both dataframes, however, numeric values are slightly different whereas the correct version is in df2 but needs to be in df1. I need to do this for many, many variables in a complex data set and wonder whether someone could help with a more efficient way to code this (possibly without using column references).
Here some example data:
# dataframe 1
var1 <- c(1:10)
var2 <- c(1:10)
df1 <- data.frame(var1,var2)
# dataframe 2
var1 <- c(11:20)
var2 <- c(11:20)
df2 <- data.frame(var1,var2)
# assigning correct values
df1$var1 <- df2$var1
df1$var2 <- df2$var2
As Parfait has said, the current post seems a bit too simplified to give any immediate help but I will try and summarize what you may need for something like this to work.
If the assumption is that df1 and df2 have the same number of rows AND that their orders are already matching, then you can achieve this really easily by the following subset notation:
df1[,c({column names df1}), drop = FALSE] <- df2[, c({column names df2}), drop = FALSE]
Lets say that df1 has columns a, b, and c and you want to replace b and c with two columns of df1 whose columns are x, y, z.
df1[,c("b","c"), drop = FALSE] <- df2[, c("y", "z"), drop = FALSE]
Here we are replacing b with y and c with z. The drop argument is just for added protection against subsetting a data.frame to ensure you don't get a vector.
If you do NOT know the order is correct or one data frame may have a differing size than the other BUT there is a unique identifier between the two data.frames - then I would personally use a function that is designed for merging two data frames. Depending on your preference you can use merge from base or use *_join functions from the dplyr package (my preference).
library(dplyr)
#assuming a and x are unique identifiers that can be matched.
new_df <- left_join(df1, df2, by = c("a"="x"))

Repetitive Action Over Ten Matrices in R

I have ten datasets, and each dataset contains "ratings" and "occupation" columns. From each of those ten datasets I want to find out the "average" of "ratings" per three occupation groups (i.e. artists, technician, marketing).
The code I have written is as follows:
Average.Rating.per.Interval <- data.frame(interval=as.numeric(),
occupation=as.character(),
average.rating=as.numeric(),
stringsAsFactors=FALSE)
##interval number refers to the dataset number (e.g. for 'e.1' it is 1, for 'e.2' it's 2)
Average.Rating.per.Interval <- as.matrix(Average.Rating.per.Interval)
e.1.artist <- e.1[which(e.1[,"occupation"]=='artist', arr.ind = TRUE),]
mean(e.1.artist$rating)
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval,
c(interval=1,occupation="artist",average.rating=mean(e.1.artist$rating)))
e.1.technician <- e.1[which(e.1[,"occupation"]=='technician', arr.ind = TRUE),]
mean(e.1.technician$rating)
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval,
c(1,"technician",mean(e.1.technician$rating)))
e.1.marketing <- e.1[which(e.1[,"occupation"]=='marketing', arr.ind = TRUE),]
mean(e.1.marketing$rating)
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval,
c(1,"marketing",mean(e.1.marketing$rating)))
This is clearly not efficient at all, because for ten datasets, I have to rewrite the same code 9 more times to get the average ratings for each of those occupations groups for all of my ten datasets. Is there a better way to do this? I cannot think of anything better! I found out that apply/lapply can be a way to do this, but I could not figure out how they can work for my case.
Two of my datasets (e1 and e2) can be found here. (I have only included 10% of the entire observations in each)
You can use the tidyverse package to summarize each of your data frames. First, you'll want to put them in a list. Then you can iterate over each of the data frames in the list, summarizing by occupation:
library(tidyverse)
# Create sample data
set.seed(2353)
sample_data <- rerun(10, tibble(
occupation = sample(c("Artist", "Technician", "Marketing"), 100, replace = TRUE),
ratings = sample(1:100, 100, replace = TRUE)
))
# Summarize by occupation
summarized_data <- sample_data %>%
map(~ .x %>% group_by(occupation) %>% summarize(avg_rating = mean(ratings)))
Another option, with base. First load the files into a list, then use lapply to calculate the means for each dataset
# Set directory to a file that contains the files
files <- list.files()
# Load all the data at once into a single list
l <- lapply(files, dget)
names(l) <- substr(files, 1, 2) # gives meaningful names to list elements (datasets)
# Calculate the mean by group for each dataset
all_group_means <- lapply(l, function(x) tapply(x$rating, x$occupation, mean, na.rm = TRUE))
# Subset all the group means to just those you're interested in
sapply(all_group_means, function(x) x[c("artist", "technician", "marketing")])
d1 d2
artist 3.540984 3.612048
technician 3.519512 3.651106
marketing 3.147208 3.342569
Note that if your data are already all loaded, you could just put them into a list (rather then loading all the data directly into a list) and then use the lapply function and it should still work.
Edit
I just realized you only wanted the means for the three groups. I've edited the code above to subset all means to only the three groups.
I recommend the "plyr" package for this kind of manipulation; it is well worth the investment of an hour or so to learn. In your case, I loaded your first example dataset in "d1", and I can summarise it like so:
ddply(d1, .(occupation), summarise, mean_rating=mean(rating))
This shows the results for all occupations, and you only wanted a specific three, so we can filter it to those:
ddply(subset(d1, occupation %in% c('artist','technician','marketing')), summarise, mean_rating=mean(rating))
Now we just need to generalize it to running over 10 datasets without cut and paste. Let's store our data frames inside a list:
dataset_list <- list(d1=d1) # you would put all of them here; I just have one
Now we can run the same code on all of them, with lapply, and get a list back out:
filtered_occupations <- c('artist','technician','marketing')
lapply(dataset_list, function(dataset) {
ddply(subset(dataset,occupation %in% filtered_occupations),
.(occupation), summarise, mean_rating=mean(rating))} )
Result:
$d1
occupation mean_rating
1 artist 3.540984
2 marketing 3.147208
3 technician 3.519512

r Using dcast in for loop to find mean of multiple columns and compile them in a new dataframe

I have a dataframe (DF_melted) which I obtained by melting some other dataset. The DF_melted dataframe has columns "month","A","B","C","D","E","F". From the following code using dcast, I am able to get a dataframe which contains value of mean of the variable for each combination of "A" and "month". This all works fine and as expected.
dcast_data<-dcast(DF_Melted,
month+A~variable,
fun.aggregate = mean)
Question-
On lines of the above code, I want to run a for loop to automatically obtain the dataset (using dcast) for relationship of month+A, month+B, month+C, month+D . I am unable to figure out about how to substitute 'A' (or B, C, D) in a paremetric manner.
I tried the following code where I reference to A,B,C,D as per their column number in DF_melted and it works:
for(j in seq(2,5, by=1)) #'A' is 2nd column, 'D' is 5th column
{
dcast_data<-dcast(DF_Melted,
month+DF_Melted[,j]~variable,
fun.aggregate = mean)
FinalDF<-cbind(FinalDF,dcast_data)
}
Although the above works, I am wondering if there is a smarter way to do the above without referencing the column number of the data frame?
Eventually my intention is to get a dataframe 'FinalDF' so that I could use it to plot the month v/s variable graph for each category of A, B, C, D. So doing this data reshaping automatically would be an immense help.
Consider a do.call(cbind, dfList) on a list of melted data frames:
dfList <- lapply(c("A", "B", "C", "D"), function(i) {
dcast(DF_Melted, month+DF_Melted[i]~variable,
fun.aggregate = mean)
})
FinalDF <- do.call(cbind, dfList)

Subsetting efficiently on multiple columns and rows

I am trying to subset my data to drop rows with certain values of certain variables. Suppose I have a data frame df with many columns and rows, I want to drop rows based on the values of variables G1 and G9, and I only want to keep rows where those variables take on values of 1, 2, or 3. In this way, I aim to subset on the same values across multiple variables.
I am trying to do this with few lines of code and in a manner that allows quick changes to the variables or values I would like to use. For example, assuming I start with data frame df and want to end with newdf, which excludes observations where G1 and G9 do not take on values of 1, 2, or 3:
# Naive approach (requires manually changing variables and values in each line of code)
newdf <- df[which(df$G1 %in% c(1,2,3), ]
newdf <- df[which(newdf$G9 %in% c(1,2,3), ]
# Better approach (requires manually changing variables names in each line of code)
vals <- c(1,2,3)
newdf <- df[which(df$G1 %in% vals, ]
newdf <- df[which(newdf$G9 %in% vals, ]
If I wanted to not only subset on G1 and G9 but MANY variables, this manual approach would be time-consuming to modify. I want to simplify this even further by consolidating all of the code into a single line. I know the below is wrong but I am not sure how to implement an alternative.
newdf <- c(1,2,3)
newdf <- c(df$G1, df$G9)
newdf <- df[which(df$vars %in% vals, ]
It is my understanding I want to use apply() but I am not sure how.
You do not need to use which with %in%, it returns boolean values. How about the below:
keepies <- (df$G1 %in% vals) & (df$G9 %in% vals)
newdf <- df[keepies, ]
Use data.table
First, melt your data
library(data.table)
DT <- melt.data.table(df)
Then split into lists
DTLists <- split(DT, list(DT[1:9])) #this is the number of columns that you have.
Now you can operate on the lists recursively using lapply
DTresult <- lapply(DTLists, function(x) {
...
}

Resources