I have a large dataset for which I need to generate multiple cross-tables. These are particularly two dimensional tables to generate frequencies along with mean and SD.
So give an example I have the below data -
City <- c("A","B","A","A","B","C","D","A","D","C")
Q1 <- c("Agree","Agree","Agree","Agree","Agree","Neither","Neither","Disagree","Agree","Agree")
df <- data.frame(City,Q1)
Keeping the data in mind, I want to generate a cross-table with mean as below -
City
A B C D
Agree 3 2 1 1
Neither 1 1
Disagree 1
Total 4 2 2 2
Mean 2.5 3 2.5 2.5
When generating the mean, Agree is given a weight of 3, Neither is given a weight of 2 and Disagree is a given a weight of 1. The cross-table output should have the mean just below the Total column. It would be good to have gridlines between each column and row.
Can you please suggest how to achieve this in R?
Here's a possible solution using addmargins which allows you to pass predefined functions to your table result
wm <- function(x) sum(x * c(3, 1, 2)) / sum(x)
addmargins(table(df[2:1]), 1, list(list(Total = sum, Mean = wm)))
# City
# Q1 A B C D
# Agree 3.0 2.0 1.0 1.0
# Disagree 1.0 0.0 0.0 0.0
# Neither 0.0 0.0 1.0 1.0
# Total 4.0 2.0 2.0 2.0
# Mean 2.5 3.0 2.5 2.5
If you want SD to, you can simply add , SD = sd to the functions list
Here's a solution:
x <- table(df$Q1, df$City) #building basic crosstab
#assigning weights to vector
weights <- c("Agree" = 3, "Disagree" = 1, "Neither" = 2)
#getting weighted mean
weightedmean <- apply(x, 2, function(x) {sum(x * weights)/sum(x)})
#building out table
x <- rbind(x,
apply(x, 2, sum), #row sums
weightedmean)
rownames(x)[4:5] <- c("Total", "Mean")
Related
I'd like to sum rows 2 by 2, in order to study the lag of certain variable.
Suppose that I have the following the data base:
> SE eggs
4 2.0
6 4.0
7 10.0
8 0.5
5 1.0
1 3.0
2 6.0
3 9.0
So, I expected to obtain the following, where eggsare the sum of the indexes "SE"'s:
> df
SE2 eggs
"4+5" 3
"6+7" 14
"8+1" 3.5
"2+3" 15
Where
df = data.frame(SE=c(4,6,7,8,5,1,2,3),eggs = c(2,4,10,0.5,1,3,6,9))
Obs.: Don't mater the order of the data frame, but I need to start from certain number (in this case, number 4), and then take the next number, in this case, number 5, and keep this logic. After SE 6+7, SE 8+1, SE 2+3...
Any hint on how can I do that?
I think I get the logic. You want ascending numbers starting from 4. When these numbers reach 8 (or whatever the maximum value of SE is), they wrap around back to one and continue to ascend until all the numbers are used up.
You then group these numbers into sequential pairs.
For each pair of numbers, you find the rows of your data frame with the matching values of SE. These rows contain the two values of eggs you wish to sum.
df = data.frame(SE=c(4,6,7,8,5,1,2,3),eggs = c(2,4,10,0.5,1,3,6,9))
first <- 4
i <- match(df$SE, c(first:nrow(df), seq(first - 1)))
groups <- ((seq_along(i) + 1) %/% 2)[i]
do.call(rbind, lapply(split(df, groups), function(x) {
data.frame(SE = paste(x$SE, collapse = "+"), eggs = sum(x$eggs))}))
#> SE eggs
#> 1 4+5 3.0
#> 2 6+7 14.0
#> 3 8+1 3.5
#> 4 2+3 15.0
Created on 2020-02-17 by the reprex package (v0.3.0)
Match c(4:8, 1:3) to SE using the match indexes to index into eggs, reshape into a 2x4 matrix and sum each column.
k <- 4 # starting index
nr <- nrow(df) # no of rows in df
with(df, colSums(matrix(eggs[match(c(k:nr, seq_len(k-1)), SE)], 2)))
## [1] 3.0 14.0 3.5 15.0
Another option, just a slight variation on my comment where we re-arrange the rows according to the specified logic and then aggregate every two rows:
aggregate(
eggs ~ ceiling(seq_along(SE)/2),
FUN = sum,
data = df[with(df, order(factor(SE, levels = c(seq(SE[1], max(SE)), SE[!SE %in% seq(SE[1], max(SE))])))),]
)[, -1]
[1] 3.0 14.0 3.5 15.0
Or, if you'd like to keep the SE in the specified format:
df <- aggregate(
. ~ ceiling(seq_along(SE)/2),
FUN = paste, collapse = '+',
data = df[with(df, order(factor(SE, levels = c(seq(SE[1], max(SE)), SE[!SE %in% seq(SE[1], max(SE))])))),]
)[, -1]
df$eggs <- sapply(df$eggs, function(x) eval(parse(text = x)))
Output:
df
SE eggs
1 4+5 3.0
2 6+7 14.0
3 8+1 3.5
4 2+3 15.0
I got column means and range(min, max) from my data.
df=matrix(c(3, 5, 2, 3, 6, 3,4, 4, 4, 5, 4, 3,5, 5, 5),ncol=3,byrow=TRUE)
colnames(df)<-paste0("ch", 1:ncol(df))
rownames(df)<-paste0("G", 1:nrow(df))
mean<- colMeans(df, na.rm = FALSE, dims = 1)
range<-apply(df, 2, range)
rownames(range) <- c("min","max")
res<-rbind(mean,range)
I have a standard mean value(4). Now I want to add additional row showing significant marks(**) with the existing output. Mean values less than 4 were considered significant. Somehow I got significant marks but I failed to add this with the existing result.
f<-res[1,] <4
test <- factor(f, labels=c("Ns", "**"))
result<-rbind(mean,range,test)
result
ch1 ch2 ch3
mean 4 4.8 3.4
min 3 4.0 2.0
max 5 6.0 5.0
test 1 1.0 2.0
I wanted this like following one
ch1 ch2 ch3
mean 4 4.8 3.4
min 3 4.0 2.0
max 5 6.0 5.0
test Ns Ns **
I need your help to solve this.
rbind.data.frame(mean = mean, range, test = as.character(test))
# ch1 ch2 ch3
# mean 4 4.8 3.4
# min 3 4 2
# max 5 6 5
# test Ns Ns **
See ?rbind.data.frame for a detail.
I think, Matrix can only store the data that have same type. Here, the first three rows are numeric. However, the test is factor, and it's coerced to numeric, that Ns and ** mapping to 1 and 2.
I suggest you should use data.frame to do it.
res<-rbind(mean,range)
res<-data.frame(t(res))
f<-res[1,] <4
test <- factor(f, labels=c("Ns", "**"))
res<-cbind(res,test)
I hope this ansewer can help you!
I have a dataframe that looks like this
ID value
1 0.5
1 0.6
1 0.7
2 0.5
2 0.5
2 0.5
and I would like to add a column with normalization for values of the same ID like this: norm = value/max(values with same ID)
ID value norm
1 0.5 0.5/0.7
1 0.6 0.6/0.7
1 0.7 1
2 0.5 1
2 0.3 0.3/0.5
2 0.5 1
Is there an easy way to do this in R without first sorting and then looping?
Cheers
A solution using basic R tools:
data$norm <- with(data, value / ave(value, ID, FUN = max))
Function ave is pretty useful, and you may want to read ?ave.
# Create an example data frame
dt <- read.csv(text = "ID, value
1, 0.5
1, 0.6
1, 0.7
2, 0.5
2, 0.5
2, 0.5")
# Load package
library(tidyverse)
# Create a new data frame with a column showing normalization
dt2 <- dt %>%
# Group the ID, make sure the following command works only in each group
group_by(ID) %>%
# Create the new column norm
# norm equals each value divided by the maximum value of each ID group
mutate(norm = value/max(value))
We can use data.table
library(data.table)
setDT(dt)[, norm := value/max(value), ID]
This is my data (imagine I have 1050 rows of data shown below)
ID_one ID_two parameterX
111 aaa 23
222 bbb 54
444 ccc 39
My code then will divide the rows into groups of 100 (there will be 10 groups of 100 rows).
I then want to get the summary statistics per group. (not working)
After that I want to place the summary statistics in a data frame to plot them.
For example, put all 10 means for parameterX in a dataframe together, put all 10 std dev for parameterX in the same a data frame together etc
The following code is not working:
#assume data is available
dataframe_size <- nrow(thedata)
group_size <- 100
number_ofgroups <- round(dataframe_size / group_size)
#splitdata into groups of 100
split_dataframe_into_groups <- function(x,y)
0:(x-1) %% y
list1 <- split(thedata, split_dataframe_into_groups(nrow(thedata), group_size))
#print data in the first group
list1[[1]]$parameterX
#NOT WORKING!!! #get summary stat for all 10 groups
# how to loop through all 10 groups?
list1_stat <- do.call(data.frame, list(mean = apply(list1[[1]]$parameterX, 2, mean),
sd = apply(list1[[1]]$parameterX, 2, sd). . .))
the error message is always:
Error in apply(...) dim(x) must have a positive length
That makes NO sense because when I run this code, There is clearly a positive length (data exists)
#print data in the first group
list1[[1]]$parameterX
#how to put all means in a dataframe?
# how to put all standard deviations in the same dataframe
ex df1 <- mean(2,2,3,4,7,2,4,,9,8,9),
sd (0.1, 3 , 0.5, . . .)
dplyr is so good for this kind of thing. If you create a new column that assigns a 'group' ID based on row location, then you can summarize each group very easily. I use an index to assist in assigning group IDs.
install.packages('dplyr')
library(dplyr)
## Create index
df$index <- 1:nrow(df)
## Assign group labels
df$group <- paste("Group", substr(df$index, 1, 1), sep = " ")
df[df$index <= 100, 'group'] <- "Group 0"
df[df$index > 1000, 'group'] <- paste("Group", substr(df$index, 1, 2), sep = " ")
df[df$index > 10000, 'group'] <- paste("Group", substr(df$index, 1, 3), sep = " ")
## Get summaries
df <- group_by(df, group)
summaries <- summarise(df, avg = mean(parameterX),
minimum = min(parameterX),
maximum = max(parameterX),
med = median(parameterX),
Mode = mode(parameterX))
... and so on.
Hope this helps.
I think this might be a good place to use tapply. there is an excellent summary here! One path forward might be an extension of the below:
df <- data.frame(id= c(rep("AA",10),rep("BB",10)), x=runif(20))
do.call("rbind", tapply(df$x, df$id, summary))
I think this is what you want :
require(dplyr)
dt<-rbind(iris,iris,iris)
dataframe_size <- nrow(dt)
group_size <- 100
number_ofgroups <- round(dataframe_size / group_size)
df<-dt %>%
# Creating the "bins" column using mutate
mutate(bins=cut(seq(1:dataframe_size),breaks=number_ofgroups)) %>%
# Aggregating the summary statistics by the bins variable
group_by(bins) %>%
# Calculating the mean
summarise(mean.Sepal.Length = mean( Sepal.Length))
head(dt)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
df
bins mean.Sepal.Length
(fctr) (dbl)
1 (0.551,113] 5.597345
2 (113,226] 5.755357
3 (226,338] 5.919643
4 (338,450] 6.100885
I have a data frame, in which I want to find the reuse lengh of (x,y). Can someone suggest me the quickest method to analyze it. For example:
df <- data.frame(
time=c(0,1,2,3,4,5,6),
x=c(1,4,2,1,6,1,4),
y=c(2,5,3,2,7,2,5)
)
I want average or median of the re-occurence of the same (x,y)/
Here, (1,2) repeats at time 0, 3, 5. So average = ((3-0) + (5-3))/2 = 2.5
And average for (4,5) is 5.
So, overall average is 3.75.
Can someone suggest how to do this?
Thanks.
Perhaps you're looking for something like this:
out <- aggregate(time ~ x + y, df, function(blah) {
mean(diff(blah))
})
out
# x y time
# 1 1 2 2.5
# 2 2 3 NaN
# 3 4 5 5.0
# 4 6 7 NaN
sum(out$time, na.rm=TRUE)
# [1] 7.5
A data.table approach:
library(data.table)
DT <- data.table(df, key = "x,y")
DT[, mean(diff(time)), by = key(DT)][, sum(V1, na.rm=TRUE)]
# [1] 7.5