Collapsing Levels of a Factor Variable in one column while summing the counts in another - r

I originally had a vary wide data (4 rows with 158 columns) which I used reshape::melt() on to create a long data set (624 rows x 3 columns).
Now, however, I have a data set like this:
demo <- data.frame(region = as.factor(c("North", "South", "East", "West")),
criteria = as.factor(c("Writing_1_a", "Writing_2_a", "Writing_3_a", "Writing_4_a",
"Writing_1_b", "Writing_2_b", "Writing_3_b", "Writing_4_b")),
counts = as.integer(c(18, 27, 99, 42, 36, 144, 99, 9)))
Which produces a table similar to the one below:
region criteria counts
North Writing_1_a 18
South Writing_2_a 27
East Writing_3_a 99
West Writing_4_a 42
North Writing_1_b 36
South Writing_2_b 144
East Writing_3_b 99
West Writing_4_b 9
Now what I want to create is something like this:
goal <- data.frame(region = as.factor(c("North", "South", "East", "West")),
criteria = as.factor(c("Writing_1", "Writing_2", "Writing_3", "Writing_4")),
counts = as.integer(c(54, 171, 198, 51)))
Meaning that when I collapse the criteria columns it sums the counts:
region criteria counts
North Writing_1 54
South Writing_2 171
East Writing_3 198
West Writing_4 51
I have tried using forcats::fct_collapse and forcats::recode()but to no avail - I'm positive I'm just not doing it right. Thank you in advance for any assistance you can provide.

You can think about what exactly you're trying to do to change factor levels—fct_collapse would manually collapse several levels into one level, and fct_recode would manually change the labels of individual levels. What you're trying to do is change all the labels based on applying some function, in which case fct_relabel is appropriate.
You can write out an anonymous function when you call fct_relabel, or just pass it the name of a function and that function's argument(s). In this case, you can use stringr::str_remove to find and remove a regex pattern, and regex such as _[a-z]$ to remove any underscore and then lowercase letter that appear at the end of a string. That way it should scale well with your real data, but you can adjust it if not.
library(tidyverse)
...
new_crits <- demo %>%
mutate(crit_no_digits = fct_relabel(criteria, str_remove, "_[a-z]$"))
new_crits
#> region criteria counts crit_no_digits
#> 1 North Writing_1_a 18 Writing_1
#> 2 South Writing_2_a 27 Writing_2
#> 3 East Writing_3_a 99 Writing_3
#> 4 West Writing_4_a 42 Writing_4
#> 5 North Writing_1_b 36 Writing_1
#> 6 South Writing_2_b 144 Writing_2
#> 7 East Writing_3_b 99 Writing_3
#> 8 West Writing_4_b 9 Writing_4
Verifying that this new variable has only the levels you want:
levels(new_crits$crit_no_digits)
#> [1] "Writing_1" "Writing_2" "Writing_3" "Writing_4"
And then summarizing based on that new factor:
new_crits %>%
group_by(crit_no_digits) %>%
summarise(counts = sum(counts))
#> # A tibble: 4 x 2
#> crit_no_digits counts
#> <fct> <int>
#> 1 Writing_1 54
#> 2 Writing_2 171
#> 3 Writing_3 198
#> 4 Writing_4 51
Created on 2018-11-04 by the reprex package (v0.2.1)

A dplyr solution using regular expressions:
demo %>%
mutate(criteria = gsub("(_a)|(_b)", "", criteria)) %>%
group_by(region, criteria) %>%
summarize(counts = sum(counts)) %>%
arrange(criteria) %>%
as.data.frame
region criteria counts
1 North Writing_1 54
2 South Writing_2 171
3 East Writing_3 198
4 West Writing_4 51

Related

Find unique values in a column minus values that are in vector

I'd like to find the unique values of a column, but take away values that are in specified vectors. In the example data below I'd like to find the unique values from the column all_areas minus the values in the vectors area1 and area2.
i.e. the result should be "town", "city", "village"
set.seed(1)
area_df = data.frame(all_areas = sample(rep(c("foo", "bar", "big", "small", "town", "city", "village"),5),20),
number = sample(1:100, 20))
area1 = c("foo", "bar")
area2 = c("big", "small")
You could use the function setdiff to find the set difference between all_areas and area1 and area2 combined:
setdiff(area_df$all_areas, c(area1, area2))
[1] "city" "village" "town"
We may use %in% to create a logical vector, negate (!) to subset the other elements from 'all_areas' and then return the unique rows with unique
unique(subset(area_df, !all_areas %in% c(area1, area2)))
-output
all_areas number
5 village 44
7 city 33
8 town 84
9 city 35
10 village 70
11 town 74
16 village 87
19 town 40
20 village 93
With a dplyr approach:
library(dplyr)
area_df %>%
filter(!all_areas %in% c(area1, area2)) %>%
distinct
#> all_areas number
#> 1 village 44
#> 2 city 33
#> 3 town 84
#> 4 city 35
#> 5 village 70
#> 6 town 74
#> 7 village 87
#> 8 town 40
#> 9 village 93

R combine rows and columns within a dataframe

I've looked around for a while trying to figure this out, but I just can't seem to describe my problem concisely enough to google my way out of it. I am trying to work with Michigan COVID stats where the data has Detroit listed separately from Wayne County. I need to add Detroit's numbers to Wayne County's numbers, then remove the Detroit rows from the data frame.
I have included a screen grab too. For the purposes of this problem, can someone explain how I can get Detroit City added to Dickinson, and then make the Detroit City rows disappear? Thanks.
library(tidyverse)
library(openxlsx)
cases_deaths <- read.xlsx("https://www.michigan.gov/coronavirus/-/media/Project/Websites/coronavirus/Cases-and-Deaths/4-20-2022/Cases-and-Deaths-by-County-2022-04-20.xlsx?rev=f9f34cd7a4614efea0b7c9c00a00edfd&hash=AA277EC28A17C654C0EE768CAB41F6B5.xlsx")[,-5]
# Remove rows that don't describe counties
cases_deaths <- cases_deaths[-c(51,52,101,102,147,148,167,168),]
Code chunk output picture
You could do:
cases_deaths %>%
filter(COUNTY %in% c("Wayne", "Detroit City")) %>%
mutate(COUNTY = "Wayne") %>%
group_by(COUNTY, CASE_STATUS) %>%
summarize_all(sum) %>%
bind_rows(cases_deaths %>%
filter(!COUNTY %in% c("Wayne", "Detroit City")))
#> # A tibble: 166 x 4
#> # Groups: COUNTY [83]
#> COUNTY CASE_STATUS Cases Deaths
#> <chr> <chr> <dbl> <dbl>
#> 1 Wayne Confirmed 377396 7346
#> 2 Wayne Probable 25970 576
#> 3 Alcona Confirmed 1336 64
#> 4 Alcona Probable 395 7
#> 5 Alger Confirmed 1058 8
#> 6 Alger Probable 658 5
#> 7 Allegan Confirmed 24109 294
#> 8 Allegan Probable 3024 52
#> 9 Alpena Confirmed 4427 126
#> 10 Alpena Probable 1272 12
#> # ... with 156 more rows
Created on 2022-04-23 by the reprex package (v2.0.1)

how to calculate mean based on conditions in for loop in r

I have what I think is a simple question but I can't figure it out! I have a data frame with multiple columns. Here's a general example:
colony = c('29683','25077','28695','4865','19858','2235','1948','1849','2370','23196')
age = c(21,23,4,25,7,4,12,14,9,7)
activity = c(19,45,78,33,2,49,22,21,112,61)
test.df = data.frame(colony,age,activity)
test.df
I would like for R to calculate average activity based on the age of the colony in the data frame. Specifically, I want it to only calculate the average activity of the colonies that are the same age or older than the colony in that row, not including the activity of the colony in that row. For example, colony 29683 is 21 years old. I want the average activity of colonies older than 21 for this row of my data. That would include colony 25077 and colony 4865; and the mean would be (45+33)/2 = 39. I want R to do this for each row of the data by identifying the age of the colony in the current row, then identifying the colonies that are older than that colony, and then averaging the activity of those colonies.
I've tried doing this in a for loop in R. Here's the code I used:
test.avg = vector("numeric",nrow(test.df))`
for (i in 1:10){
test.avg[i] <- mean(subset(test.df$activity,test.df$age >= age[i])[-i])
}
R returns a list of values where half of them are correct and the the other half are not (I'm not even sure how it calculated those incorrect numbers..). The numbers that are correct are also out of order compared to how they're listed in the dataframe. It's clearly able to do the right thing for some iterations of the loop but not all. If anyone could help me out with my code, I would greatly appreciate it!
colony = c('29683','25077','28695','4865','19858','2235','1948','1849','2370','23196')
age = c(21,23,4,25,7,4,12,14,9,7)
activity = c(19,45,78,33,2,49,22,21,112,61)
test.df = data.frame(colony,age,activity)
library(tidyverse)
test.df %>%
mutate(result = map_dbl(age, ~mean(activity[age > .x])))
#> colony age activity result
#> 1 29683 21 19 39.00000
#> 2 25077 23 45 33.00000
#> 3 28695 4 78 39.37500
#> 4 4865 25 33 NaN
#> 5 19858 7 2 42.00000
#> 6 2235 4 49 39.37500
#> 7 1948 12 22 29.50000
#> 8 1849 14 21 32.33333
#> 9 2370 9 112 28.00000
#> 10 23196 7 61 42.00000
# base
test.df$result <- with(test.df, sapply(age, FUN = function(x) mean(activity[age > x])))
test.df
#> colony age activity result
#> 1 29683 21 19 39.00000
#> 2 25077 23 45 33.00000
#> 3 28695 4 78 39.37500
#> 4 4865 25 33 NaN
#> 5 19858 7 2 42.00000
#> 6 2235 4 49 39.37500
#> 7 1948 12 22 29.50000
#> 8 1849 14 21 32.33333
#> 9 2370 9 112 28.00000
#> 10 23196 7 61 42.00000
Created on 2021-03-22 by the reprex package (v1.0.0)
The issue in your solution is that the index would apply to the original data.frame, yet you subset that and so it does not match anymore.
Try something like this: First find minimum age, then exclude current index and calculate average activity of cases with age >= pre-calculated minimum age.
for (i in 1:10){
test.avg[i] <- {amin=age[i]; mean(subset(test.df[-i,], age >= amin)$activity)}
}
You can use map_df :
library(tidyverse)
test.df %>%
mutate(map_df(1:nrow(test.df), ~
test.df %>%
filter(age >= test.df$age[.x]) %>%
summarise(av_acti= mean(activity))))

Is there an R function for including column into row data

I would like to perform chi-square test in R by transform data frame from csv file using R from the following structure
Observed Values East West North South
Males 50 142 131 70
Females 435 1523 1356 750
to
following example
Row Observed value Region
1 1 East
2 1 East
3 1 East
...
435 0 East
Given that 1 = male. 0 = female
I been trying to use stack and data frame function to create the new table using R. I need the following table to perform chi-square test in R. The code I am trying is as below:
Stacked_data <- stack(data)
library(dummies)
df1 <- data.frame(id = 1:0, Observed.Values )
df2 <- cbind(Stacked_data, dummy(df1$id, sep = "_"))
Expected result will contain 2 column (observed value and region). Observed value will contain the categorical value for male = 1, and female = 0.Region will contain the region for respective observed value.
So that when i perform
table(Region,Observed Values)
It will produce
Observed Values
Region 1 0
East 50 435
West 142 1523
North 131 1356
South 70 750
Update: based on your expected output, you don't need much at all. Using obs from below, all you need to get your output (on which you can run chisq.test) is:
obs2 <- t(obs[,-1])
dimnames(obs2) <- list(Region = rownames(obs2), Observed = c('0', '1'))
obs2
# Observed
# Region 0 1
# East 50 435
# West 142 1523
# North 131 1356
# South 70 750
But, then again, if all you need is to run a chisq.test on them, it doesn't matter which orientation you use:
### original frame you provided
chisq.test(obs[,-1])
# Pearson's Chi-squared test
# data: as.matrix(obs[, -1])
# X-squared = 1.5959, df = 3, p-value = 0.6603
### transposed/re-labeled frame
chisq.test(obs2)
# Pearson's Chi-squared test
# data: obs2
# X-squared = 1.5959, df = 3, p-value = 0.6603
No difference. Perhaps all you needed was the [,-1] part?
Here's an attempt, though I don't know that it's exactly what you expect. (Input data is at the bottom of this answer.)
library(dplyr)
library(tidyr)
out1 <- obs %>%
gather(Region, v, -Observed) %>%
rowwise() %>%
do( tibble(Region = .$Region, Observed = rep(1L * (.$Observed == "Males"), .$v)) ) %>%
ungroup() %>%
mutate(Row = row_number())
out1
# # A tibble: 4,457 x 3
# Region Observed Row
# <chr> <int> <int>
# 1 East 1 1
# 2 East 1 2
# 3 East 1 3
# 4 East 1 4
# 5 East 1 5
# 6 East 1 6
# 7 East 1 7
# 8 East 1 8
# 9 East 1 9
# 10 East 1 10
# # ... with 4,447 more rows
We can verify that it is reversible with
xtabs(~ Observed + Region, data = out1)
# Region
# Observed East North South West
# 0 435 1356 750 1523
# 1 50 131 70 142
(even if the columns and rows are in a different order as the input, the numbers match).
Data:
obs <- read.table(header=TRUE, stringsAsFactors=FALSE, text="
Observed East West North South
Males 50 142 131 70
Females 435 1523 1356 750 ")

Struggling to Create a Pivot Table in R

I am very, very new to any type of coding language. I am used to Pivot tables in Excel, and trying to replicate a pivot I have done in Excel in R. I have spent a long time searching the internet/ YouTube, but I just can't get it to work.
I am looking to produce a table in which I the left hand side column shows a number of locations, and across the top of the table it shows different pages that have been viewed. I want to show in the table the number of views per location which each of these pages.
The data frame 'specificreports' shows all views over the past year for different pages on an online platform. I want to filter for the month of October, and then pivot the different Employee Teams against the number of views for different pages.
specificreports <- readxl::read_excel("Multi-Tab File - Dashboard
Usage.xlsx", sheet = "Specific Reports")
specificreportsLocal <- tbl_df(specificreports)
specificreportsLocal %>% filter(Month == "October") %>%
group_by("Employee Team") %>%
This bit works, in that it groups the different team names and filters entries for the month of October. After this I have tried using the summarise function to summarise the number of hits but can't get it to work at all. I keep getting errors regarding data type. I keep getting confused because solutions I look up keep using different packages.
I would appreciate any help, using the simplest way of doing this as I am a total newbie!
Thanks in advance,
Holly
let's see if I can help a bit. It's hard to know what your data looks like from the info you gave us. So I'm going to guess and make some fake data for us to play with. It's worth noting that having field names with spaces in them is going to make your life really hard. You should start by renaming your fields to something more manageable. Since I'm just making data up, I'll give my fields names without spaces:
library(tidyverse)
## this makes some fake data
## a data frame with 3 fields: month, team, value
n <- 100
specificreportsLocal <-
data.frame(
month = sample(1:12, size = n, replace = TRUE),
team = letters[1:5],
value = sample(1:100, size = n, replace = TRUE)
)
That's just a data frame called specificreportsLocal with three fields: month, team, value
Let's do some things with it:
# This will give us total values by team when month = 10
specificreportsLocal %>%
filter(month == 10) %>%
group_by(team) %>%
summarize(total_value = sum(value))
#> # A tibble: 4 x 2
#> team total_value
#> <fct> <int>
#> 1 a 119
#> 2 b 172
#> 3 c 67
#> 4 d 229
I think that's sort of like what you already did, except I added the summarize to show how it works.
Now let's use all months and reshape it from 'long' to 'wide'
# if I want to see all months I leave out the filter and
# add a group_by month
specificreportsLocal %>%
group_by(team, month) %>%
summarize(total_value = sum(value)) %>%
head(5) # this just shows the first 5 values
#> # A tibble: 5 x 3
#> # Groups: team [1]
#> team month total_value
#> <fct> <int> <int>
#> 1 a 1 17
#> 2 a 2 46
#> 3 a 3 91
#> 4 a 4 69
#> 5 a 5 83
# to make this 'long' data 'wide', we can use the `spread` function
specificreportsLocal %>%
group_by(team, month) %>%
summarize(total_value = sum(value)) %>%
spread(team, total_value)
#> # A tibble: 12 x 6
#> month a b c d e
#> <int> <int> <int> <int> <int> <int>
#> 1 1 17 122 136 NA 167
#> 2 2 46 104 158 94 197
#> 3 3 91 NA NA NA 11
#> 4 4 69 120 159 76 98
#> 5 5 83 186 158 19 208
#> 6 6 103 NA 118 105 84
#> 7 7 NA NA 73 127 107
#> 8 8 NA 130 NA 166 99
#> 9 9 125 72 118 135 71
#> 10 10 119 172 67 229 NA
#> 11 11 107 81 NA 131 49
#> 12 12 174 87 39 NA 41
Created on 2018-12-01 by the reprex package (v0.2.1)
Now I'm not really sure if that's what you want. So feel free to make a comment on this answer if you need any of this clarified.
Welcome to Stack Overflow!
I'm not sure I correctly understand your need without a data sample, but this may work for you:
library(rpivotTable)
specificreportsLocal %>% filter(Month == "October")
rpivotTable(specificreportsLocal, rows="Employee Team", cols="page", vals="views", aggregatorName = "Sum")
Otherwise, if you do not need it interactive (as the Pivot Tables in Excel), this may work as well:
specificreportsLocal %>% filter(Month == "October") %>%
group_by_at(c("Employee Team", "page")) %>%
summarise(nr_views = sum(views, na.rm=TRUE))

Resources