I have a dataset that looks like this:
Starting Dataset
Code used to create the Starting dataset:
dataset<-data.frame(Attorney=c("John Doe", "Client #1","274", "296",
"297", "Client #2", "633", "Jane Doe",
"Client #1", "309", "323"),
Date=c(NA, NA, "2019/4/4", "2019/4/4", "2019/4/12",
NA, " 2019/2/3", NA, NA, "2019/12/1", "2019/12/4"),
Code=c(NA, NA, "7NP/7NP", "1UE/1UE", "2C1/2C1",NA,
"7NP/7NP", NA, NA, "7NP/7NP", "7FU/7FU"),
Billed_Amount=c(NA, NA, 1200.00, 4000.00, 2775.00,
NA, 1200.00, NA, NA, 1200.00, 385),
Amount= c(NA, NA, "1200", "4000", "2775", NA, "1200",
NA, NA, "1200", "385"),
Current =c(NA, NA, 0, 0, 0, NA, 0, NA, NA, 0, 0),
X.120=c(NA, NA, "1200", "4000", "2775", NA, "1200",
NA, NA, "1200", "385"))
My goal is to end up with a dataset that looks like:
Goal Dataset
Code used to create Goal dataset:
dataset<-data.frame(Attorney=c("John Doe", "John Doe", "John Doe",
"John Doe", "Jane Jane", "Jane Jane"),
Date=c("2019/4/4", "2019/4/4", "2019/12/4", " 2019/2/3",
"2019/12/1","2019/12/4" ),
Code=c("7NP/7NP", "1UE/1UE","2C1/2C1", "7NP/7NP",
"7NP/7NP", "7FU/7FU"),
Billed_Amount=c(1200.00, 4000.00,2775.00, 1200.00,
1200.00, 385),
Amount= c(1200, 4000, 2775, 1200,1200, 385),
Current= c(0, 0, 0, 0, 0, 0),
X.120=c(1200, 4000, 2775,1200, 1200, 385))
I want to rename the rows underneath each attorney with the attorney's name while not worrying about preserving the client's name. My original dataset has a number of attorneys and they have a varying number of clients and those clients have a various number of codes, dates, and amounts associated with them.
I tried to use if else statement but encountered an error message.
I appreciate any help you can give me. Thanks!
Edit: I have edited my question to include hypothetical attorney names.
An option is to create a grouping variable based on the presence of 'Attorney substring in 'Attorney' column, then mutate the 'Attorney' column with the first element of 'Attorney' after grouping by 'grp', filter out the NA elements
library(dplyr)
library(stringr)
dataset %>%
group_by(grp = cumsum(str_detect(Attorney, "^Attorney"))) %>%
mutate(Attorney = first(Attorney)) %>%
filter_at(vars(Date:X.120), all_vars(!is.na(.))) %>%
ungroup %>%
select(-grp)
We can also use na.omit here
dataset %>%
group_by(grp = cumsum(str_detect(Attorney, "^Attorney"))) %>%
mutate(Attorney = first(Attorney)) %>%
ungroup %>%
select(-grp) %>%
na.omit
# A tibble: 6 x 7
# Attorney Date Code Billed_Amount Amount Current X.120
# <fct> <fct> <fct> <dbl> <fct> <dbl> <fct>
#1 Attorney #1 "2019/4/4" 7NP/7NP 1200 1200 0 1200
#2 Attorney #1 "2019/4/4" 1UE/1UE 4000 4000 0 4000
#3 Attorney #1 "2019/4/12" 2C1/2C1 2775 2775 0 2775
#4 Attorney #1 " 2019/2/3" 7NP/7NP 1200 1200 0 1200
#5 Attorney #2 "2019/12/1" 7NP/7NP 1200 1200 0 1200
#6 Attorney #2 "2019/12/4" 7FU/7FU 385 385 0 385
Or another option is to fill the 'Attorney' column after replaceing the non 'Attorney' substring elements with NA so that it gets filled with the previous non-NA element, then do na.omit
library(tidyr)
dataset %>%
mutate(Attorney = replace(Attorney, !str_detect(Attorney, "Attorney"), NA)) %>%
fill(Attorney) %>%
na.omit
Base R solution (using #akrun's logic):
data.frame(do.call("rbind",
lapply(split(dataset, cumsum(!(grepl("\\d+", dataset$Attorney)))),
function(x){
non_att_cols <- names(x)[names(x) != "Attorney"]
y <- data.frame(na.omit(x[,non_att_cols]))
y$Attorney <- x$Attorney[1]
return(y[,c("Attorney", non_att_cols)])
}
)
),
row.names = NULL
)
Related
I have selected a few columns within the data set and I want to make a table by using gtsummary. I have come across some issues and not sure how to make it work.
Part of the reproducible data are here
structure(list(country = c("SGP", "JPN", "THA", "CHN", "JPN",
"CHN", "CHN", "JPN", "JPN", "JPN"), Final_Medal = c(NA, NA, NA,
NA, NA, "GOLD", NA, NA, NA, NA), Success = c(0, 0, 0, 0, 0, 1,
0, 0, 0, 0)), row.names = c(NA, 10L), class = "data.frame")
And it looks like this :
country Final_Medal Success
SGP NA 0
JPN NA 0
THA NA 0
Final_Medal contain NA, GOLD, SILVER and BRONZE
Success contains 0 and 1
All I want for the output is to group by country and count number of medal and success for each country.
Desire output:
Country GOLD Silver Bronze Success Total_Entry
SGP 5 2 10 17 50
JPN 4 3 5 12 60
CHN 5 2 6 13 60
Success will only count 1 and Total_Entry I want it to be included doesn't matter if it is 0 or 1
I have a code that look like this but it does't work and am not sure what needs to be done.
library(gtsummary)
example%>%tbl_summary(
by = country,
missing = "no" # don't list missing data separately
) %>%
bold_labels()
You may do the aggregation in dplyr and use gt/gtsummary for display purpose.
library(dplyr)
library(gt)
df %>%
group_by(country) %>%
summarise(Gold = sum(Final_Medal == 'GOLD', na.rm = TRUE),
Silver = sum(Final_Medal == 'SILVER', na.rm = TRUE),
Bronze = sum(Final_Medal == 'BRONZE', na.rm = TRUE),
Success = sum(Success),
Total_Entry = n()) %>%
gt()
I have a dataset with hundreds of thousands of measurements taken from several subjects. However, the measurements are only partially available, i.e., there may be large stretches with NA. I need to establish up front, for which timespan positive data are available for each subject.
Data:
df
timestamp C B A starttime_ms
1 00:00:00.033 NA NA NA 33
2 00:00:00.064 NA NA NA 64
3 00:00:00.066 NA 0.346 NA 66
4 00:00:00.080 47.876 0.346 22.231 80
5 00:00:00.097 47.876 0.346 22.231 97
6 00:00:00.099 47.876 0.346 NA 99
7 00:00:00.114 47.876 0.346 NA 114
8 00:00:00.130 47.876 0.346 NA 130
9 00:00:00.133 NA 0.346 NA 133
10 00:00:00.147 NA 0.346 NA 147
My (humble) solution so far is (i) to pick out the range of timestamp values that are not NA and to select the first and last such timestamp for each subject individually. Here's the code for subject C:
NotNA_C <- df$timestamp[which(!is.na(df$C))]
range_C <- paste(NotNA_C[1], NotNA_C[length(NotNA_C)], sep = " - ")
range_C
[1] "00:00:00.080" "00:00:00.130"
That doesn't look elegant and, what's more, it needs to be repeated for all other subjects. Is there a more efficient way to establish the range of time for which non-NA values are available for all subjects in one go?
EDIT
I've found a base R solution:
sapply(df[,2:4], function(x)
paste(df$timestamp[which(!is.na(x))][1],
df$timestamp[which(!is.na(x))][length(df$timestamp[which(!is.na(x))])], sep = " - "))
C B A
"00:00:00.080 - 00:00:00.130" "00:00:00.066 - 00:00:00.147" "00:00:00.080 - 00:00:00.097"
but would be interested in other solutions as well!
Reproducible data:
df <- structure(list(timestamp = c("00:00:00.033", "00:00:00.064",
"00:00:00.066", "00:00:00.080", "00:00:00.097", "00:00:00.099",
"00:00:00.114", "00:00:00.130", "00:00:00.133", "00:00:00.147"
), C = c(NA, NA, NA, 47.876, 47.876, 47.876, 47.876, 47.876,
NA, NA), B = c(NA, NA, 0.346, 0.346, 0.346, 0.346,
0.346, 0.346, 0.346, 0.346), A = c(NA, NA, NA, 22.231, 22.231, NA, NA, NA, NA,
NA), starttime_ms = c(33, 64, 66, 80, 97, 99, 114, 130, 133,
147)), row.names = c(NA, 10L), class = "data.frame")
dplyr solution
library(tidyverse)
df <- structure(list(timestamp = c("00:00:00.033", "00:00:00.064",
"00:00:00.066", "00:00:00.080", "00:00:00.097", "00:00:00.099",
"00:00:00.114", "00:00:00.130", "00:00:00.133", "00:00:00.147"
), C = c(NA, NA, NA, 47.876, 47.876, 47.876, 47.876, 47.876,
NA, NA), B = c(NA, NA, 0.346, 0.346, 0.346, 0.346,
0.346, 0.346, 0.346, 0.346), A = c(NA, NA, NA, 22.231, 22.231, NA, NA, NA, NA,
NA), starttime_ms = c(33, 64, 66, 80, 97, 99, 114, 130, 133,
147)), row.names = c(NA, 10L), class = "data.frame")
df %>%
pivot_longer(-c(timestamp, starttime_ms)) %>%
group_by(name) %>%
drop_na() %>%
summarise(min = timestamp %>% min(),
max = timestamp %>% max())
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 3
#> name min max
#> <chr> <chr> <chr>
#> 1 A 00:00:00.080 00:00:00.097
#> 2 B 00:00:00.066 00:00:00.147
#> 3 C 00:00:00.080 00:00:00.130
Created on 2021-02-15 by the reprex package (v0.3.0)
You could look at the cumsum of differences where there's no NA, coerce them to logical and subset first and last element.
lapply(data.frame(apply(rbind(0, diff(!sapply(df[c("C", "B", "A")], is.na))), 2, cumsum)),
function(x) c(df$timestamp[as.logical(x)][1], rev(df$timestamp[as.logical(x)])[1]))
# $C
# [1] "00:00:00.080" "00:00:00.130"
#
# $B
# [1] "00:00:00.066" "00:00:00.147"
#
# $A
# [1] "00:00:00.080" "00:00:00.097"
I am working on a dataset for a welfare wage subsidy program, where wages per worker are structured as follows:
df <- structure(list(wage_1990 = c(13451.67, 45000, 10301.67, NA, NA,
8726.67, 11952.5, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA,
NA, 9881.67, 5483.33, 12868.33, 9321.67), wage_1991 = c(13451.67,
45000, 10301.67, NA, NA, 8750, 11952.5, NA, NA, 7140, NA, NA,
10301.67, 7303.33, NA, NA, 9881.67, 5483.33, 12868.33, 9321.67
), wage_1992 = c(13451.67, 49500, 10301.67, NA, NA, 8750, 11952.5,
NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67, NA,
12868.33, 9321.67), wage_1993 = c(NA, NA, 10301.67, NA, NA, 8750,
11958.33, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67,
NA, NA, 9321.67), wage_1994 = c(NA, NA, 10301.67, NA, NA, 8948.33,
11958.33, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67,
NA, NA, 9321.67), wage_1995 = c(NA, NA, 10301.67, NA, NA, 8948.33,
11958.33, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67,
NA, NA, 9321.67), wage_1996 = c(NA, NA, 10301.67, NA, NA, 8948.33,
11958.33, NA, NA, 7291.67, NA, NA, 10301.67, 7303.33, NA, NA,
9881.67, NA, NA, 9321.67)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -20L))
I have tried one proposed solution, which is running this code after the one above:
average_growth_rate <- apply(df, 1, function(x) {
x1 <- x[!is.na(x)]
mean(x1[-1]/x1[-length(x1)]-1)})
out <- data.frame(rowid = seq_len(nrow(df)), average_growth_rate)
out[!is.na(out$average_growth_rate),]
But I keep getting this error:
Error in dim(X) <- c(n, length(X)/n) : dims [product 60000] do not match the length of object [65051]
I want to do the following: 1-Create a variable showing the annual growth rate of wage for each worker or lack of thereof.
The practical issue that I am facing is that each observation is in one row and while the first worker joined the program in 1990, others might have joined in say 1993 or 1992. Therefore, is there a way to apply the growth rate for each worker depending on the specific years they worked, rather than applying a general growth formula for all observations?
My expected output for each row would be having a new column
average wage growth rate
1- 15%
2- 9%
3- 12%
After running the following code to see descriptive statistics of my variable of interest:
skim(df$average_growth_rate)
I get the following result:
"Variable contains Inf or -Inf value(s) that were converted to NA.── Data Summary ────────────────────────
Values
Name gosi_beneficiary_growth$a...
Number of rows 3671
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None
── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1 data 1348 0.633 Inf Inf -1 -0.450 0 0.0568
"
I am not sure why my mean and standard deviation values are Inf.
Here is one approach:
library(tidyverse)
growth <- df %>%
rowid_to_column() %>%
gather(key, value, -rowid) %>%
drop_na() %>%
arrange(rowid, key) %>%
group_by(rowid) %>%
mutate(yoy = value / lag(value)-1) %>%
summarise(average_growth_rate = mean(yoy, na.rm=T))
# A tibble: 12 x 2
rowid average_growth_rate
<int> <dbl>
1 1 0
2 2 0.05
3 3 0
4 6 0.00422
5 7 0.0000813
6 10 0.00354
7 13 0
8 14 0
9 17 0
10 18 0
11 19 0
12 20 0
And just to highlight that all these 0s are expected, here the dataframe:
> head(df)
# A tibble: 6 x 7
wage_1990 wage_1991 wage_1992 wage_1993 wage_1994 wage_1995 wage_1996
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 13452. 13452. 13452. NA NA NA NA
2 45000 45000 49500 NA NA NA NA
3 10302. 10302. 10302. 10302. 10302. 10302. 10302.
4 NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA
6 8727. 8750 8750 8750 8948. 8948. 8948.
where you see that e.g. for the first row, there was no growth nor any decline. The second row, there was a slight increase in between the second and the third year, but it was 0 for the first and second. For the third row, again absolutely no change. Etc...
Also, finally, to add these results to the initial dataframe, you would do e.g.
df %>%
rowid_to_column() %>%
left_join(growth)
And just to answer the performance question, here a benchmark (where I changed akrun's data.frame call to a tibble call to make sure there is no difference coming from this). All functions below correspond to creating the growth rates, not merging back to the original dataframe.
library(microbenchmark)
microbenchmark(cj(), akrun(), akrun2())
Unit: microseconds
expr min lq mean median uq max neval cld
cj() 5577.301 5820.501 6122.076 5988.551 6244.301 10646.9 100 c
akrun() 998.301 1097.252 1559.144 1160.450 1212.552 28704.5 100 a
akrun2() 2033.801 2157.101 2653.018 2258.052 2340.702 34143.0 100 b
base R is the clear winner in terms of performance.
We can use base R with apply. Loop over the rows with MARGIN = 1, remove the NA elements ('x1'), get the mean of the ratio of the current and previous element
average_growth_rate <- apply(df, 1, function(x) {
x1 <- x[!is.na(x)]
mean(x1[-1]/x1[-length(x1)]-1)})
out <- data.frame(rowid = seq_len(nrow(df)), average_growth_rate)
out[!is.na(out$average_growth_rate),]
# rowid average_growth_rate
#1 1 0.00000000000
#2 2 0.05000000000
#3 3 0.00000000000
#6 6 0.00422328325
#7 7 0.00008129401
#10 10 0.00354038282
#13 13 0.00000000000
#14 14 0.00000000000
#17 17 0.00000000000
#18 18 0.00000000000
#19 19 0.00000000000
#20 20 0.00000000000
Or using tapply/stack
na.omit(stack(tapply(as.matrix(df), row(df), FUN = function(x)
mean(head(na.omit(x), -1)/tail(na.omit(x), -1) -1))))[2:1]
Here is my data frame:
structure(list(Q = c(NA, 346.86, 166.95, 162.57, NA, NA, NA,
266.7), L = c(18.93, NA, 15.72, 39.51, NA, NA, NA, NA), C = c(NA,
23.8, NA, 8.47, 20.89, 18.72, 14.94, NA), X = c(40.56, NA, 26.05,
3.08, 23.77, 59.37, NA, NA), W = c(29.47, NA, NA, NA, 36.08,
NA, 27.34, 28.19), S = c(NA, 7.47, NA, NA, 18.64, NA, 25.34,
NA), Y = c(NA, 2.81, 0, NA, NA, 21.18, 10.83, 12.19), H = c(0,
NA, NA, NA, NA, 0, NA, 0)), class = "data.frame", row.names = c(NA,
-8L), .Names = c("Q", "L", "C", "X", "W", "S", "Y", "H"))
Each row has 4 variables that are NAs, now I want to do the same operations to every row:
Drop those 4 varibles that are NAs
Calculate diversity for the rest 4 variables (it's just some computations involved with the rest, here I use diversity() from vegan)
Append the output to a new data frame
But the problem is:
How to do drop NA variables using dplyr? I don't know whether select() can make it.
How to apply operations to every row of a data frame?
It seems that drop_na() will remove the entire row for my dataset, any suggestion?
With tidyverse it may be better to gather into 'long' format and then spread it back. Assuming that we have exactly 4 non-NA elements per row, create a row index with rownames_to_column (from tibble), gather (from tidyr) into 'long' format, remove the NA elements, grouped by row number ('rn'), change the 'key' values to common values and then spread it to wide' format
library(tibble)
library(tidyr)
library(dplyr)
res <- rownames_to_column(df1, 'rn') %>%
gather(key, val, -rn) %>%
filter(!is.na(val)) %>%
group_by(rn) %>%
mutate(key = LETTERS[1:4]) %>%
spread(key, val) %>%
ungroup %>%
select(-rn)
res
# A tibble: 8 x 4
# A B C D
#* <dbl> <dbl> <dbl> <dbl>
#1 18.9 40.6 29.5 0
#2 347 23.8 7.47 2.81
#3 167 15.7 26.0 0
#4 163 39.5 8.47 3.08
#5 20.9 23.8 36.1 18.6
#6 18.7 59.4 21.2 0
#7 14.9 27.3 25.3 10.8
#8 267 28.2 12.2 0
diversity(res)
# 1 2 3 4 5 6 7 8
#1.0533711 0.3718959 0.6331070 0.7090783 1.3517680 0.9516232 1.3215712 0.4697572
Regarding the diversity calculation, we can replace NA with 0 and apply on the whole dataset i.e.
library(vegan)
diversity(replace(df1, is.na(df1), 0))
#[1] 1.0533711 0.3718959 0.6331070 0.7090783
#[5] 1.3517680 0.9516232 1.3215712 0.4697572
as we get the same output as in the first solution
So I have the following data set (this is a small sample/example of what it looks like, with the original being 7k rows and 30 columns over 7 decades):
Year,Location,Population Total, Median Age, Household Total
2000, Adak, 220, 45, 67
2000, Akiachak, 567, NA, 98
2000, Rainfall, 2, NA, 11
1990, Adak, NA, 33, 56
1990, Akiachak, 456, NA, 446
1990, Tioga, 446, NA, NA
I want to create a summary table that indicates how many years of data is available by location for each variable. So something like this would work (for the small example from before):
Location,Population Total, Median Age, Household Total
Adak,1,2,2
Akiachak,2,0,2
Rainfall,1,0,1
Tioga,1,0,0
I'm new to R and haven't used these two commands together so I'm unsure of the syntax. Any help would be wonderful or alternatives.
A solution with summarize_all from dplyr:
library(dplyr)
df %>%
group_by(Location) %>%
summarize_all(funs(sum(!is.na(.)))) %>%
select(-Year)
Or you can use summarize_at:
df %>%
group_by(Location) %>%
summarize_at(vars(-Year), funs(sum(!is.na(.))))
Result:
# A tibble: 4 x 4
Location PopulationTotal MedianAge HouseholdTotal
<chr> <int> <int> <int>
1 Adak 1 2 2
2 Akiachak 2 0 2
3 Rainfall 1 0 1
4 Tioga 1 0 0
Data:
df = read.table(text = "Year,Location,PopulationTotal, MedianAge, HouseholdTotal
2000, Adak, 220, 45, 67
2000, Akiachak, 567, NA, 98
2000, Rainfall, 2, NA, 11
1990, Adak, NA, 33, 56
1990, Akiachak, 456, NA, 446
1990, Tioga, 446, NA, NA", header = TRUE, sep = ",", stringsAsFactors = FALSE)
library(dplyr)
df = df %>%
mutate_at(vars(PopulationTotal:HouseholdTotal), as.numeric)
You can do something like this:
x %>%
group_by(Location) %>%
summarise(count_years = n(),
count_pop_total = sum(!is.na(Population_Total)),
count_median_age = sum(!is.na(Median_Age)),
count_house_total = sum(!is.na(Household_Total)))
where you can replace the mean with whatever operation you want to perform. You should take a look at the dplyr vignette for more general solutions.