Using the GermanCredit dataset from the caret library.
library("caret")
data(GermanCredit)
After filtering this down a bit
credit.all <- GermanCredit[,c(10, 1:9, 11:13, 16:19)]
attach(credit.all)
names(credit.all)
We have these names
[1] "Class" "Duration"
[3] "Amount" "InstallmentRatePercentage"
[5] "ResidenceDuration" "Age"
[7] "NumberExistingCredits" "NumberPeopleMaintenance"
[9] "Telephone" "ForeignWorker"
[11] "CheckingAccountStatus.lt.0" "CheckingAccountStatus.0.to.200"
[13] "CheckingAccountStatus.gt.200" "CreditHistory.ThisBank.AllPaid"
[15] "CreditHistory.PaidDuly" "CreditHistory.Delay"
[17] "CreditHistory.Critical"
What I need to do is pivot at summarize on two of these columns, something I know how to do in SQL like this.
SELECT
Class
, SUM(CASE WHEN `CreditHistory.Critical` = 1 THEN 1 ELSE 0 END) AS Critical
, SUM(CASE WHEN `CreditHistory.Critical` = 0 THEN 1 ELSE 0 END) AS NotCritical
, SUM(CASE WHEN `CreditHistory.Critical` = 1 THEN 1 ELSE 0 END) / COUNT(*) AS PctCritical
FROM `credit.all`
GROUP BY
Class
Which would produce something like this
However, I am struggling mightily to get a foothold in R, using books and Google, it seems I should use reshape2 melt and dcast to achieve something like this. What I have tried are basically variants of this:
library(reshape2)
credit.melted <- melt(credit.all[,c(1,17)], ID=c("name", "Class"))
dcast(credit.melted, Class~CreditHistory.Critical, nrow, fill=0)
But all of my attempts with these functions have produced errors too cryptic and too common to understand what I am doing wrong.
Error in vapply(indices, fun, .default) : values must be length 1,
but FUN(X[[1]]) result is length 0
Sometimes my random permutations of the function calls produce slightly different error output, but nothing that points me in the right direction.
Question: How can I do the pivoted summary similar to the SQL result using R?
I wouldn't really consider this a pivot. You're not trying to use a pivot command in SQL. You can use dplyr to follow the exact same method as your SQL:
library(dplyr)
credit.all %>%
group_by(Class) %>%
summarize(Critical = sum(CreditHistory.Critical == 1),
NotCritical = sum(CreditHistory.Critical == 0),
PctCritical = mean(CreditHistory.Critical == 1))
# # A tibble: 2 x 4
# Class Critical NotCritical PctCritical
# <fct> <int> <int> <dbl>
# 1 Bad 50 250 0.167
# 2 Good 243 457 0.347
Since it's a binary column the == 1 isn't really necessary, but I leave it in because (a) it's more similar to your SQL code, and (b) if there were other values but you wanted the count of 1s, this would be the way to do it. However, you could get the same results a little more simply like this:
credit.all %>%
group_by(Class) %>%
summarize(Critical = sum(CreditHistory.Critical),
NotCritical = n() - Critical,
PctCritical = Critical / n())
If you really want a pivot approach, we can go that route, it just seems less straightforward. Your data is already in a long format, so we don't need to melt, we can just cast:
pivot = dcast(Class ~ CreditHistory.Critical, data = credit.all)
pivot
# Using CreditHistory.Critical as value column: use value.var to override.
# Aggregation function missing: defaulting to length
# Class 0 1
# 1 Bad 250 50
# 2 Good 457 243
You could then rename the columns and calculate the percentages:
names(pivot)[2:3] = c("NotCritical", "Critical")
pivot$PctCritical = with(pivot, Critical / (Critical + NotCritical)
Related
I'm having a hard time making rounded percentages that add up to 100% within groups.
Consider the following example:
# Loading main library used
library(dplyr)
# Creating the basic data frame
df = data.frame(group = c('A','A','A','A','B','B','B','B'),
categories = c('Cat1','Cat2','Cat3','Cat4','Cat1','Cat2','Cat3','Cat4'),
values = c(2200,4700,3000,2000,2900,4400,2200,1000))
print(df)
# group categories values
# 1 A Cat1 2200
# 2 A Cat2 4700
# 3 A Cat3 3000
# 4 A Cat4 2000
# 5 B Cat1 2900
# 6 B Cat2 4400
# 7 B Cat3 2200
# 8 B Cat4 1000
df_with_shares = df %>%
# Calculating group totals and adding them back to the main df
left_join(df %>%
group_by(group) %>%
summarize(group_total = sum(values)),
by='group') %>%
# Calculating each category's share within the groups
mutate(group_share = values / group_total,
group_share_rounded = round(group_share,2))
# Summing the rounded shares within groups
rounded_totals = df_with_shares %>%
group_by(group) %>%
summarize(total_share = sum(group_share_rounded))
print(rounded_totals)
# # A tibble: 2 x 2
# group total_share
# <chr> <int>
# 1 A 0.99
# 2 B 1.01
# Note how the totals do not add up to 100% as expected
I am aware of a few generic solutions to the "rounding percentages to add up to 100%" problem, as explained in this SO post. I was even able to make a little R implementation of one of those approaches, as seen here. This is what it would look like if I just applied that R approach to this problem:
df_with_rounded_shares = df %>%
mutate(
percs = values / sum(values),
percs_cumsum = cumsum(percs),
percs_cumsum_round = round(percs_cumsum, 2),
percs_cumsum_round_offset = replace_na(lag(percs_cumsum_round,1),0),
percs_rounded_final = percs_cumsum_round - percs_cumsum_round_offset)
However, the method I devised in the thread above does not work as I would like. It just calculates the shares of the values column across the whole dataset. In other words, it does not take into consideration the grouping variable representing the multiple groups in the data, each of which need their rounded values to add up to 100% independently from every other group.
What can I do to generate a column of rounded percentages that add up to 100% by group?
PS: While writing this question I actually found something that worked, so I'll answer my own question below. I know it's super simple, but I think it's still worth having a direct answer here on SO addressing this issue.
The method devised in your implementation (from here) just needs a few small tweaks to make it work.
First, include a group_by statement before calculating the new columns. Also, you need to use a summarize statement instead of the mutate statement you have now.
In essence, this is what it'll look like:
# Modified version of your implementation of the rounding procedure.
# The new procedure below accommodates for grouping variables.
df_with_rounded_shares_by_group = df %>%
group_by(group) %>%
summarize(
group_share = values / sum(values),
group_share_cumsum = cumsum(group_share),
group_share_cumsum_round = round(group_share_cumsum, 2),
group_share_cumsum_round_offset = replace_na(lag(group_share_cumsum_round,1),0),
group_share_rounded_final = group_share_cumsum_round - group_share_cumsum_round_offset) %>%
# Removing unnecessary temporary columns
select(-group_share_cumsum, -group_share_cumsum_round, -group_share_cumsum_round_offset)
# Verifying if the results add up to 100% within each group
rounded_totals = df_with_rounded_shares_by_group %>%
group_by(group) %>%
summarize(total_share = sum(group_share_rounded_final))
print(rounded_totals)
# # A tibble: 2 x 2
# group total_share
# <chr> <dbl>
# 1 A 1
# 2 B 1
# Yep, they all add up to 100% as expected!
Btw, apologies for the ridiculously long column names. I just made them enormous to make it clear what each step was really doing.
I´m currently struggling with a coding task concerning the use of a case_when statement in R.
In general I would like to use the looked at row index of the case_when statement in the assignment part.
A short explanation to the data. I have large data.frame with a date-column, a geo layer-column and some numeric columns with numbers for the calculations.
The data.frame doesn't have any sorting and not for every point in time all geo layers are necessarily in the data.frame. Sadly I can't provide a real data set due to legal issues.
The task at hand is to compute on the one hand simple mathematical operations for the same point in time on the other side to compute mathematical operations for different points in time for the same geo layer and numeric value.
The mathematical operations vary as dose the interval between the time points.
For instance I need to calculate a change rate to the last quarter and last year of the value:
((current_value - last_quarter_value) / current_value)*100
This is how I'd like to code it.
library(tidyverse)
test_dataframe <- data.frame(
times = c(rep(as.Date("2021-03-01"),2),rep(as.Date("2020-12-01"),2)),
geo_layer = rep(c("001001001", "001001002"),2),
numeric_value_a = 1:4,
numeric_value_b = 4:1,
numeric_value_c = c(1,NA,3,1)
)
check_comparison_times <- unique(test_dataframe$times)
test_dataframe <- test_dataframe %>%
mutate(
normale_calculation = case_when(
!is.na(numeric_value_c) ~ (numeric_value_a + numeric_value_b) / numeric_value_c,
TRUE ~ Inf
),
time_comparison = case_when(
is.na(numeric_value_c) ~ Inf,
(times - months(3)) %in% check_comparison_times ~ test_dataframe[
which(
test_dataframe[,"times"] ==
(test_dataframe[row_index_of_current_looked_at_row, "times"] - months(3)) &
test_dataframe[,"geo_layer"] ==
test_dataframe[row_index_of_current_looked_at_row, "geo_layer"]
)
,"numeric_value_c"] - test_dataframe[row_index_of_current_looked_at_row, "numeric_value_c"],
TRUE ~ -Inf
)
)
With this desired outcome:
times geo_layer numeric_value_a numeric_value_b numeric_value_c normal_calculation time_comparison
1 2021-03-01 001001001 1 4 1 5.000000 2
2 2021-03-01 001001002 2 3 NA Inf Inf
3 2020-12-01 001001001 3 2 3 1.666667 -Inf
4 2020-12-01 001001002 4 1 1 5.000000 -Inf
Currently I solve the problem with a triple loop in which I first pair the Values for time then for geo_layer and then execute the mathematical operation.
Since my Data-Set is much much lager than that this. This solution is every in efficient.
Thanks for your help.
I'm working with R for the first time for a class in college. To preface this: I don't know enough to know what I don't know, so I'm sorry if this question has been asked before. I am trying to predict the results of the Texas state house elections in 2020, and I think the best prior for that is the results of the 2018 state house elections. There are 150 races, so I can't bare to input them all by hand, but I can't find any spreadsheet that has data formatted how I want it. I want it in a pretty standard table format:
My desired table format. However, the table from the Secretary of state I have looks like the following:
Gross ugly table.
I wrote some psuedo code:
Here's the Psuedo Code, basically we want to construct a new CSV:
'''%First, we want to find a district, the house races are always preceded by a line of dashes, so I will need a function like this:
Create a New CSV;
for(x=1; x<151 ; x +=1){
Assign x to the cell under the district number cloumn;
Find "---------------" ;
Go down one line;
Go over two lines;
% We should now be in the third column and now want to read in which party got how many votes. The number of parties is not consistant, so we need to account for uncontested races, libertarians, greens, and write ins. I want totals for Republicans, Democrats, and Other.
while(cell is not empty){
Party <- function which reads cell (but I want to read a string);
go right one column;
Votes <- function which reads cell (but I want to read an integer);
if(Party = Rep){
put this data in place in new CSV;
else if (Party = Dem)
put this data in place in new CSV;
else
OtherVote += Votes;
};
};
Assign OtherVote to the column for other party;
OtherVote <- 0;
%Now I want to assign 0 to null cells (ones where no rep, or no Dem, or no other party contested
read through single row 4 spaces, if its null assign it 0;
Party <- null
};'''
But I don't know enough to google what to do! Here's what I need help with: Can I create a new CSV in Rstudio, how? How can I read specific cells in a table, hopefully indexing? Lastly, how do I write to a table in R. Any help is appreciated! Thank you!
Can I create a new CSV in Rstudio, how?
Yes you can. Use the "write.csv" function.
write.csv(df, file = "df.csv") #see help for more information.
How can I read specific cells in a table?
Use the brackets after df,example below.
df <- data.frame(x = c(1,2,3), y = c("A","B","C"), z = c(15,25,35))
df[1,1]
#[1] 1
df[1,1:2]
# x y
#1 1 A
How do I write to a table in R?
If you want to write a table in xlsx use the function write.xlsx from openxlsx package.
Wikipedia seems to have a table that is closer to the format you are looking for.
In order to get to the table you are looking for we need a few steps:
Download data from Wikipedia and extract table.
Clean up table.
Select columns.
Calculate margins.
1. Download data from wikipedia and extract table.
The rvest table helps with downloading and parsing websites into R objects.
First we download the HTML of the whole website.
library(dplyr)
library(rvest)
wiki_html <-
read_html(
"https://en.wikipedia.org/wiki/2018_United_States_House_of_Representatives_elections_in_Texas"
)
There are a few ways to get a specific object from an HTML file in this case
I dedided to look for the table that has the class name “wikitable plainrowheaders sortable”,
as I learned from inspecting the code, that the only table with that class is
the one we want to extract.
library(purrr)
html_nodes(wiki_html, "table") %>%
map_lgl( ~ html_attr(., "class") == "wikitable plainrowheaders sortable") %>%
which()
#> [1] 20
Then we can select table number 20 and convert it to a dataframe with html_table()
raw_table <-
html_nodes(wiki_html, "table")[[20]] %>%
html_table(fill = TRUE)
2. Clean up table.
The table has duplicated names, we can change that by using as_tibble() and its .name_repair argument. We then usedplyr::select() to get the columns. Furthermore we usedplyr::filter() to delete the first two rows, that have "District" as a value in theDistrictcolumn. Now the columns are still characters
vectors, but we need them to be numeric, therefore we first delete commas from
all columns and then transform columns 2 to 4 to numeric.
clean_table <-
raw_table %>%
as_tibble(.name_repair = "unique") %>%
filter(District != "District") %>%
mutate_all( ~ gsub(",", "", .)) %>%
mutate_at(2:4, as.numeric)
3. Select columns and 4. Calculate margins.
We use dplyr::select() to select the columns you are interested in and give them more helpful names.
Finally we calculate the margin between democratic and republican votes by first adding up there votes
as total_votes and then dividing the difference by total_votes.
clean_table %>%
select(District,
RepVote = Republican...2,
DemVote = Democratic...4,
OthVote = Others...6) %>%
mutate(
total_votes = RepVote + DemVote,
margin = abs(RepVote - DemVote) / total_votes * 100
)
#> # A tibble: 37 x 6
#> District RepVote DemVote OthVote total_votes margin
#> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 District 1 168165 61263 3292 229428 46.6
#> 2 District 2 139188 119992 4212 259180 7.41
#> 3 District 3 169520 138234 4604 307754 10.2
#> 4 District 4 188667 57400 3178 246067 53.3
#> 5 District 5 130617 78666 224 209283 24.8
#> 6 District 6 135961 116350 3731 252311 7.77
#> 7 District 7 115642 127959 0 243601 5.06
#> 8 District 8 200619 67930 4621 268549 49.4
#> 9 District 9 0 136256 16745 136256 100
#> 10 District 10 157166 144034 6627 301200 4.36
#> # … with 27 more rows
Edit: In case you want to go with the data provided by the state, it looks to me as if the data you are looking for is in the first, third and fourth column. So what you want to do is.
(All the code below is not tested, as I do not have the original data.)
read data into R
library(readr)
tx18 <- read_csv("filename.csv")
select relevant columns
tx18 <- tx18 %>%
select(c(1,3,4))
clean table
tx18 <- tx18 %>%
filter(!is.na(X3),
X3 != "Party",
X3 != "Race Total")
Group and summarize data by party
tx18 <- tx18 %>%
group_by(X3) %>%
summarise(votes = sum(X3))
Pivot/ Reshape data to wide format
tx18 %>$
pivot_wider(names_from = X3,
values_from = votes)
After this you could then calculate the margin similarly as I did with the Wikipedia data.
I am working with 3D motion-capture data. This means I have 3 columns (X,Y,Z) of joint coordinates for several joints in the body (e.g. the three columns describing the position of the left knee joint center are: LKX,LKY,LKZ).
My end goal is to plot at least 9 joint centers, and I believe the only way to achieve this is to transform my wide format dataframe into a long one.
As you can tell, I am trying to transform many sets of jointcenters ending with either: X,Y or Z. Therefore, I try to use regular expressions within tidyr:extract, but I just can´t get the code right.
df_wide <- data.frame(
ID = rep(1:2, each=10),
JN = rep(1:2, each=5),
Frame = rep(1:5, 4),
System = rep(1:2, 10),
RKX = rep(1:10+rnorm(10,mean=1,sd=0.5),2),
RKY = rep(1:10+rnorm(10,mean=1,sd=0.5),2),
RKZ = rep(1:10+rnorm(10,mean=1,sd=0.5), 2),
LHeX = rep(1:10-rnorm(10,mean=1,sd=0.5),2),
LHeY = rep(1:10-rnorm(10,mean=1,sd=0.5),2),
LHeZ = rep(1:10-rnorm(10,mean=1,sd=0.5),2))
head(df_wide, 2)
ID JN Frame System RKX RKY RKZ LHeX LHeY LHeZ
1 1 1 1 1 1.332827 2.068720 2.295742 -0.02336031 -0.3011227 -1.212326
2 1 1 2 2 3.570076 3.306799 3.136177 2.08828231 1.9226740 2.106496
I wish to obtain this result:
ID JN Frame System joint X Y Z
1 1 1 1 1 RK 1.440103 2.221676 1.621871
2 1 1 1 1 LHe 3.537940 3.060948 2.856955
Here is my latest (of many) attempts. It has two problems; 1) extract only produces NA; 2) spread returns "Error: Duplicate identifiers for rows" I suspect this is related to the problem with extract.
df_3D <- df_wide %>%
gather(keys, values, -ID, -JN, -Frame, -System)%>%
extract(keys, c("X", "Y", "Z", "joint"), "(X$) (Y$) (Z$) ([A-Z].$)")%>%
spread(X, values)
I have found several good questions and answers regarding the transformation, but none of them specifically target the use of regular expressions.
Your approach is a little off. Each element of the keys column once you've gathered has the structure <Joint><Coord>, so you want something like:
df_wide %>%
gather(keys, values, -ID, -JN, -Frame, -System) %>%
extract(keys, c("Joint", "Coord"), "(.*)(X|Y|Z)$") %>%
spread(Coord, values)
The regex I've used here captures anything in the first group (since I don't know all the possible joint names), then X or Y or Z as the final character in the second group. There are lots of other regexes that would achieve the same thing.
Output:
ID JN Frame System Joint X Y Z
1 1 1 1 1 LHe 0.1344259 -0.2927277 0.05375166
2 1 1 1 1 RK 1.8083539 2.4053498 2.32899399
3 1 1 2 2 LHe 1.1777492 1.1780538 0.96549849
4 1 1 2 2 RK 3.2254236 2.4100235 2.79816371
You'll need to gather your data into a super long format, then split out the dimension, then spread THAT data back out into your X, Y, and Z columns:
library(tidyr)
library(stringr)
df2 <- df_wide %>%
# leave the other columns
gather( jointid, position, -ID, -JN, -Frame, -System ) %>%
# insert a seperator to make it easier to split the X/Y/Z from the joint name
mutate(jointid = str_replace( jointid, "X|Y|Z", ";\\0")) %>%
# split the joint name and the dimension apart
tidyr::separate(jointid, c('joint', 'dim'), sep = ";" ) %>%
# spread the joint and position apart into 3 columns
spread(dim, position)
Here's a little piece of code I wrote to report variables with missing values from a data frame. I'm trying to think of a more elegant way to do this, one that perhaps returns a data.frame, but I'm stuck:
for (Var in names(airquality)) {
missing <- sum(is.na(airquality[,Var]))
if (missing > 0) {
print(c(Var,missing))
}
}
Edit: I'm dealing with data.frames with dozens to hundreds of variables, so it's key that we only report variables with missing values.
Just use sapply
> sapply(airquality, function(x) sum(is.na(x)))
Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0
You could also use apply or colSums on the matrix created by is.na()
> apply(is.na(airquality),2,sum)
Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0
> colSums(is.na(airquality))
Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0
My new favourite for (not too wide) data are methods from excellent naniar package. Not only you get frequencies but also patterns of missingness:
library(naniar)
library(UpSetR)
riskfactors %>%
as_shadow_upset() %>%
upset()
It's often useful to see where the missings are in relation to non missing which can be achieved by plotting scatter plot with missings:
ggplot(airquality,
aes(x = Ozone,
y = Solar.R)) +
geom_miss_point()
Or for categorical variables:
gg_miss_fct(x = riskfactors, fct = marital)
These examples are from package vignette that lists other interesting visualizations.
We can use map_df with purrr.
library(mice)
library(purrr)
# map_df with purrr
map_df(airquality, function(x) sum(is.na(x)))
# A tibble: 1 × 6
# Ozone Solar.R Wind Temp Month Day
# <int> <int> <int> <int> <int> <int>
# 1 37 7 0 0 0 0
summary(airquality)
already gives you this information
The VIM packages also offers some nice missing data plot for data.frame
library("VIM")
aggr(airquality)
Another graphical alternative - plot_missing function from excellent DataExplorer package:
Docs also points out to the fact that you can save this results for additional analysis with missing_data <- plot_missing(data).
More succinct-: sum(is.na(x[1]))
That is
x[1] Look at the first column
is.na() true if it's NA
sum() TRUE is 1, FALSE is 0
Another function that would help you look at missing data would be df_status from funModeling library
library(funModeling)
iris.2 is the iris dataset with some added NAs.You can replace this with your dataset.
df_status(iris.2)
This will give you the number and percentage of NAs in each column.
For one more graphical solution, visdat package offers vis_miss.
library(visdat)
vis_miss(airquality)
Very similar to Amelia output with a small difference of giving %s on missings out of the box.
I think the Amelia library does a nice job in handling missing data also includes a map for visualizing the missing rows.
install.packages("Amelia")
library(Amelia)
missmap(airquality)
You can also run the following code will return the logic values of na
row.has.na <- apply(training, 1, function(x){any(is.na(x))})
Another graphical and interactive way is to use is.na10 function from heatmaply library:
library(heatmaply)
heatmaply(is.na10(airquality), grid_gap = 1,
showticklabels = c(T,F),
k_col =3, k_row = 3,
margins = c(55, 30),
colors = c("grey80", "grey20"))
Probably won't work well with large datasets..
A dplyr solution to get the count could be:
summarise_all(df, ~sum(is.na(.)))
Or to get a percentage:
summarise_all(df, ~(sum(is_missing(.) / nrow(df))))
Maybe also worth noting that missing data can be ugly, inconsistent, and not always coded as NA depending on the source or how it's handled when imported. The following function could be tweaked depending on your data and what you want to consider missing:
is_missing <- function(x){
missing_strs <- c('', 'null', 'na', 'nan', 'inf', '-inf', '-9', 'unknown', 'missing')
ifelse((is.na(x) | is.nan(x) | is.infinite(x)), TRUE,
ifelse(trimws(tolower(x)) %in% missing_strs, TRUE, FALSE))
}
# sample ugly data
df <- data.frame(a = c(NA, '1', ' ', 'missing'),
b = c(0, 2, NaN, 4),
c = c('NA', 'b', '-9', 'null'),
d = 1:4,
e = c(1, Inf, -Inf, 0))
# counts:
> summarise_all(df, ~sum(is_missing(.)))
a b c d e
1 3 1 3 0 2
# percentage:
> summarise_all(df, ~(sum(is_missing(.) / nrow(df))))
a b c d e
1 0.75 0.25 0.75 0 0.5
If you want to do it for particular column, then you can also use this
length(which(is.na(airquality[1])==T))
ExPanDaR’s package function prepare_missing_values_graph can be used to explore panel data:
For piping you could write:
# Counts
df %>% is.na() %>% colSums()
# % of missing rounded to 2 decimals
df %>% summarise_all(.funs = ~round(100*sum(is.na(.))/length(.),2))