Let's say I have a few columns in my data frame, that come from a bunch of similar factors:
For eg: A1_Factor1, A1_Factor2, A1_Factor3, B1_Factor1,B1_Factor2,C1_Factor1 etc
What I want is to create additional columns using this data. So:
A1_Mean - This should be the average of columns starting with A1
B1_Mean - This should be the average of columns starting with B1
A1_Min - This should be the minimum value of columns starting with A1
B1_Min - This should be the minimum value of columns starting with B1
A1_SD - This should be the Standard Deviation of columns starting with A1
B1_SD - This should be the Standard Deviation of columns starting with B1
How can it be done in R, so that the code first extract the columns having similar initials, and then perform the required analysis on it. And then create new columns out of it using same initials?
Thanks for your help in advance! :)
You can do this using tidyverse package
Input:
library(tidyverse)
set.seed(123)
df <- tibble(A1_abc = sample(1:10, 5),
A1_cde = sample(10:15, 5),
B1_abc = sample(1:10, 5),
B1_cde = sample(15:20, 5))
df
# A tibble: 5 x 4
A1_abc A1_cde B1_abc B1_cde
<int> <int> <int> <int>
1 3 10 10 20
2 8 12 5 16
3 4 13 6 15
4 7 11 9 18
5 6 15 1 19
Method:
df %>%
gather(key, value) %>%
separate(key, c("gp", "rand"), sep = "_") %>%
select(-rand) %>%
group_by(gp) %>%
mutate(id = 1:n()) %>%
spread(gp, value) %>%
summarise_at(vars(2:3), funs(Min = min(.),
Max = max(.),
Mean = mean(.),
SD = sd(.)))
Output:
# A tibble: 1 x 8
A1_Min B1_Min A1_Max B1_Max A1_Mean B1_Mean A1_SD B1_SD
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3. 1. 15. 20. 8.90 11.9 3.96 6.61
If you want to add more functions, just add it at the funs() function inside the summarise_at()
I created a small example and this is what I have,
df <- data.frame("A1_factor1" = rnorm(5), "A1_factor2" = rnorm(5),
"B1_factor1" = rnorm(5), "B1_factor2" = rnorm(5))
col.names <- names(df)
group <- unique(substr(col.names, 1, 2))
for (i in 1:length(group)){
group.df <- df[, substr(names(df), 1, 2) == group[i]]
df[, ncol(df)+1] <- apply(group.df, 1, mean)
df[, ncol(df)+1] <- apply(group.df, 1, min)
df[, ncol(df)+1] <- apply(group.df, 1, sd)
df[, ncol(df)+1] <- apply(group.df, 1, max)
names(df)[(ncol(df)-3):ncol(df)] <- paste(group[i], c("Mean", "Min", "SD", "Max"), sep = "_")
}
df
I hope this helps!
Related
I have a data.frame with 150 column names. For each column, I want to extract the maximum and minimum values (the rows repeat) and the row names of each maximum value. I have extracted the min and max values in another data.frame but don't know how to match them.
I have found functions that are very close for this, like for minimum values:
head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
sapply(cars,which.min)
speed dist
1 1
Here, it only gives the first index for minimum speed.
And I've tried with loops like:
for (i in (colnames(cars))){
print(min(cars[[i]]))
}
[1] 4
[1] 2
But that just gives me the minimum values, and not if they are repeated and the rowname of each repeated value.
I want something like:
min.value column rowname freq.times
4 speed 1,2 2
2 dist 1 1
Thanks and sorry if I have orthography mistakes. No native speaker
One option is to use tidyverse. I was a little unclear if you want min and max in the same dataframe, so I included both. First, I create an index column with row numbers. Then, I pivot to long format to determine which values are minimum and maximum (using case_when). Then, I drop the rows that are not min or max (i.e., NA in category). Then, I use summarise to turn the row names into a single character string and get the frequency of a given minimum or maximum value.
library(tidyverse)
cars %>%
mutate(rowname = row_number()) %>%
pivot_longer(-rowname, names_to = "column", values_to = "value") %>%
group_by(column) %>%
mutate(category = case_when((value == min(value)) == TRUE ~ "min",
(value == max(value)) == TRUE ~ "max")) %>%
drop_na(category) %>%
group_by(column, value, category) %>%
summarise(rowname = toString(rowname), freq.times = n()) %>%
select(2:3, 1, 4, 5)
Output
# A tibble: 4 × 5
# Groups: column, value [4]
value category column rowname freq.times
<dbl> <chr> <chr> <chr> <int>
1 2 min dist 1 1
2 120 max dist 49 1
3 4 min speed 1, 2 2
4 25 max speed 50 1
However, if you want to produce the dataframes separately. Then, you could adjust something like this. Here, I don't use category and instead use filter to drop all rows that are not the minimum for a group/column. Then, we can summarise as we did above. You can do the samething for max as well.
cars %>%
mutate(rowname = row_number()) %>%
pivot_longer(-rowname, names_to = "column", values_to = "min.value") %>%
group_by(column) %>%
filter(min.value == min(min.value)) %>%
group_by(column, min.value) %>%
summarise(rowname = toString(rowname), freq.times = n()) %>%
select(2, 1, 3, 4)
Output
# A tibble: 2 × 4
# Groups: column [2]
min.value column rowname freq.times
<dbl> <chr> <chr> <int>
1 2 dist 1 1
2 4 speed 1, 2 2
Here is another tidyverse approach:
which.min(.) gives the first index, whereas which(. == min(.)) will give all indices that are true for the condition!
Analogues to get the frequence we could use: length(which(.==min(.)))
summarise across all columns min.value, rowname and freq.time
The part after is pivoting to bring the column name in position.
library(tidyverse)
cars %>%
summarise(across(dplyr::everything(), list(min.value = min,
rowname = ~list(which(. == min(.))),
freq.times = ~length(which(.==min(.)))))) %>%
pivot_longer(
cols = contains("_"),
names_to = "key",
values_to = "val",
values_transform = list(val = as.character)
) %>%
separate(key, c("column", "name"), sep="_") %>%
pivot_wider(
names_from = name,
values_from = val
) %>%
mutate(rowname = str_replace(rowname, '\\:', '\\,'))
column min.value rowname freq.times
<chr> <chr> <chr> <chr>
1 speed 4 1,2 2
2 dist 2 1 1
min.value <- sapply(cars, min)
columns <- names(min.value)
row.values <- sapply(columns, \(x) which(cars[[x]] == min.value[which(names(min.value) == x)]))
freq.times <- sapply(row.values, length)
row.values <- sapply(row.values, \(x) paste(x, collapse = ","))
names(min.value) <- names(row.values) <- names(freq.times) <- NULL
data.frame(min.value = min.value,
columns = columns,
row.values = row.values,
freq.times = freq.times)
min.value columns row.values freq.times
1 4 speed 1,2 2
2 2 dist 1 1
Here it is wrapped in function, so that you can use it across whatever data frame and function you need:
create_table <- function(df, FUN) {
values <- sapply(df, FUN)
columns <- names(values)
row.values <- sapply(columns, \(x) which(df[[x]] == values[which(names(values) == x)]))
freq.times <- sapply(row.values, length)
row.values <- sapply(row.values, \(x) paste(x, collapse = ","))
names(values) <- names(row.values) <- names(freq.times) <- NULL
data.frame(values = values,
columns = columns,
row.values = row.values,
freq.times = freq.times)
}
create_table(cars, min)
values columns row.values freq.times
1 4 speed 1,2 2
2 2 dist 1 1
create_table(cars, max)
values columns row.values freq.times
1 25 speed 50 1
2 120 dist 49 1
You can use which to obtain the positions. sapply should work. Since you need multiple summary statistics for each column, you just have to wrap up them in a list. Something like this
as.data.frame(sapply(cars, \(x) {
extrema <- range(x)
min.row <- which(x == extrema[[1L]])
max.row <- which(x == extrema[[2L]])
list(
min.value = extrema[[1L]], max.value = extrema[[2L]],
min.row = min.row, max.row = max.row,
freq.min = length(min.row), freq.max = length(max.row)
)
}))
Output
speed dist
min.value 4 2
max.value 25 120
min.row 1, 2 1
max.row 50 49
freq.min 2 1
freq.max 1 1
I have multiple objects with names depth_*
I want to summarize them like that:
depth_PATH2 %>%
+ summarise(avg = mean(V3), sd = sd(V3), med = median(V3))
which gives:
avg sd med
1 1 0 1
But I'd link to run a loop over all those files so that I would get a giant table like:
avg sd med
depth_PATH2 1 0 1
depth_PGTH7 2 7 3
etc.
Can you help?
Thanks!
M
One approach is to use mget from base R to make a list of your data.frames.
Then you can then bind_rows to make them into one data.frame, group_by the object, and summarize.
library(dplyr)
mget(ls(pattern="depth_")) %>%
bind_rows(.id = "obj") %>%
group_by(obj) %>%
summarise(avg = mean(V3), sd = sd(V3), med = median(V3))
## A tibble: 3 x 4
# obj avg sd med
# <chr> <dbl> <dbl> <dbl>
#1 depth_a 2 0 2
#2 depth_b 4.5 2.12 4.5
#3 depth_c 6 4.24 6
Sample Data
depth_a <- data.frame(A = c(1,2), V3 = c(2,2))
depth_b <- data.frame(A = c(1,2), V3 = c(6,3))
depth_c <- data.frame(A = c(1,2), V3 = c(9,3))
In my dataset,and I have several variables like this -
Hypertension = 1,0,1,1,1,1,0,1
Diabetes = 1,1,0,0,1,1,0,1
Other NCD = 1,1,0,0,0,0,1,1
here, 1 = yes and 0 = No
Now I want to bind all of these "yes" responses from the above variables and create a multiple responses table like this -
SPSS has a function called "Multiple Response". This image is one of the outputs of this function.
How do I create this table?
Thanks in advance.
Please try this.
dat <- data.frame(
Hypertension = c(1,0,1,1,1,1,0,1),
Diabetes = c(1,1,0,0,1,1,0,1),
`Other NCD` = c(1,1,0,0,0,0,1,1),
check.names = FALSE
)
library(dplyr)
library(tidyr) # pivot_longer
dat %>%
tidyr::pivot_longer(everything(), names_to="k", values_to="v") %>%
group_by(k) %>%
summarize(
n = n(),
cases = sum(v),
percent = 100 * cases / n()
) %>%
ungroup() %>%
mutate(overall = 100 * cases / sum(n))
# # A tibble: 3 x 5
# k n cases percent overall
# <chr> <int> <dbl> <dbl> <dbl>
# 1 Diabetes 8 5 62.5 20.8
# 2 Hypertension 8 6 75 25
# 3 Other NCD 8 4 50 16.7
With base R, we can do
dat1 <- transform(stack(colSums(dat)), n = nrow(dat))
dat1$percent <- 100 *dat1$values/dat1$n
dat1$overall <- round(100 * dat1$values/sum(dat1$n), 2)
data
dat <- data.frame(
Hypertension = c(1,0,1,1,1,1,0,1),
Diabetes = c(1,1,0,0,1,1,0,1),
`Other NCD` = c(1,1,0,0,0,0,1,1),
check.names = FALSE
)
I have a data frame with, for example, replicates from an experiment in different columns. If each row in my data frame is a sample, with columns a, b, c as replicates, I want to:
Determine the variation between the replicates (What is the difference between the highest and lowest values in each row)? Put this in a new column called "variation.")
If the variation is greater than 10, omit the one replicate that is furthest away.
How can I accomplish this in this data frame? I want new columns:
"max" - highest value of a, b, c for each row
"min" - lowest value of a, b, c for each row
"variation" - max/min for each row
Then, I want to omit the data points in a, b, or c that are furthest away from the others so the remaining points have <10 variation.
df <- data.frame(a = rnorm(10, 100, 20),
b = rnorm(10, 2000, 500),
c = rnorm(10, 50, 20))
df$max = apply(df, 1, max, na.rm = T)
df$min = apply(df, 1, min, na.rm = T)
df$variation = df$max/df$min
(Also, how can I calculate the max and min using dplyr and %>% notation?)
Example using dplyr pipes, with mutate and group_by. I reshaped the data in long format using tidyr gather and reshaped it in wide format at the end using spread.
library(dplyr)
library(tidyr)
set.seed(100)
dtf_wide <- data.frame(a = rnorm(10, 100, 20),
b = rnorm(10, 2000, 500),
c = rnorm(10, 50, 20))
Reshape data in long format. Group by id (row number in the wide format) Then compute the variation and the distance from the median value.
dtf <- dtf_wide %>%
# Explicitely add an identification column (for the grouping)
mutate(id = row_number()) %>%
# put data in tidy format, one observation per row
gather(key, value, a:c) %>%
arrange(id) %>%
group_by(id) %>%
mutate(variation = max(value, na.rm = TRUE) / min(value, na.rm = TRUE),
median = median(value),
distancefrommedian = abs(value-median),
maxdistancefrommedian = max(distancefrommedian))
head(dtf)
# # A tibble: 6 x 7
# # Groups: id [2]
# id key value variation median distancefrommedian maxdistancefrommedian
# <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 a 89.95615 49.58856 89.95615 0.00000 1954.987
# 2 1 b 2044.94307 49.58856 89.95615 1954.98692 1954.987
# 3 1 c 41.23820 49.58856 89.95615 48.71795 1954.987
# 4 2 a 102.63062 31.37407 102.63062 0.00000 1945.507
# 5 2 b 2048.13723 31.37407 102.63062 1945.50661 1945.507
# 6 2 c 65.28121 31.37407 102.63062 37.34941 1945.507
If varation is greater than 10, remove the line where the value is further away from the median (you could change that rule here to take away more lines if needed).
dtf <- dtf %>%
# For each id,
# Take all lines where variation is smaller than 10
filter(variation <= 10 |
# If varation is greater than 10,
# Filter out lines were the value is further away from the median
(variation > 10 & distancefrommedian < maxdistancefrommedian)) %>%
# Keep only interesting variables
select(id, key, value) %>%
# Compute the variations again (just to check)
mutate(variation = max(value, na.rm = TRUE) / min(value, na.rm = TRUE))
head(dtf)
# id key value variation
# <int> <chr> <dbl> <dbl>
# 1 1 a 89.95615 2.181379
# 2 1 c 41.23820 2.181379
# 3 2 a 102.63062 1.572131
# 4 2 c 65.28121 1.572131
# 5 3 a 98.42166 1.781735
# 6 3 c 55.23923 1.781735
Reshape data to obtain a table in wide format similar to the original data frame.
dtf_wide2 <- dtf %>%
spread(key, value)
head(dtf_wide2)
# id variation a c
# <int> <dbl> <dbl> <dbl>
# 1 1 4.385692 89.95615 41.23820
# 2 2 4.385692 102.63062 65.28121
# 3 3 4.385692 98.42166 55.23923
# 4 4 4.385692 117.73570 65.46809
# 5 5 4.385692 102.33943 33.71242
# 6 6 4.385692 106.37260 41.23099
I have person-level data and want to create a new variable that has the number of kids in a family. I have created a dummy variable for kids (1 if age<18, 0 otherwise). I'm currently using the aggregate function, where HH_ID is a household identifier.
No_kids <- aggregate(child ~ HH_ID, data = df, sum)
This code works but the data frame collapses whereas I want to assign the number of kids to each observation for that household. Is there an alternative to the aggregate function that doesn't collapse the data set?
another option is dplyr ... of course
library(dplyr)
> player_df = data.frame(team = c('ARI', 'BAL', 'BAL', 'CLE', 'CLE'),
+ player =c('A', 'B', 'C', 'D', 'F'),
+ '1' = floor(runif(5, min=1, max=2)*10),
+ '2' = floor(runif(5, min=1, max=2)*10))
and then using group_by and mutate from dplyr
player_df %>% group_by(team) %>% mutate(count = n())
Source: local data frame [5 x 5]
Groups: team [3]
team player X1 X2 count
<fctr> <fctr> <dbl> <dbl> <int>
1 ARI A 12 12 1
2 BAL B 10 12 2
3 BAL C 14 12 2
4 CLE D 10 14 2
5 CLE F 18 17 2
Alternatively, you could do a merge after aggregation (so in base R):
ag <- aggregate(child ~ HH_ID, data = df, sum)
setNames(merge(df, ag, by="HH_ID"), c("HH_ID", "child", "No_kids"))
Using the dplyr package:
# Create sample data
set.seed(3252)
df <- data.frame(
HH_ID = sample(1:10, 50, replace = TRUE),
child = sample(0:1, 50, replace = TRUE)
)
# Count number of children
df %>%
group_by(HH_ID) %>%
mutate(child_count = sum(child)) %>%
ungroup()