Related
I have a problem with the humans here; they're giving me Citizen Science data in spreadsheets formatted to be attractive and legible. I figured out the right sequence of pivots _longer and _wider to get it into an analyzable format but first I had to do a whole bunch of hand edits to make the column labels usable. I've just been given a corrected spreadsheet so now I have to do the same hand edits all over. Can I avoid this?
reprex <- read_csv("reprex.csv", col_names = FALSE)
gives:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 NA NA 2014 NA NA 2015 NA NA 2016 NA
2 NA Total F M Total F M Total F M
3 SiteA 180 92 88 134 40 94 34 20 14
4 SiteB NA NA NA 247 143 104 8 8 0
5 SiteC 237 194 43 220 95 125 62 45 17
I want column labels like "2014 Total", "2014 F", ... like so:
Location `2014 Total` `2014 F` `2014 M` `2015 Total` `2015 F` `2015 M` `2016 Total` `2016 F` `2016 M`
1 SiteA 180 92 88 134 40 94 34 20 14
2 SiteB NA NA NA 247 143 104 8 8 0
3 SiteC 237 194 43 220 95 125 62 45 17
...which would allow me to twist it up until I get to something like:
Location date Total F M
1 SiteA 2014 180 92 88
2 SiteB 2014 NA NA NA
3 SiteC 2014 237 194 43
4 SiteA 2015 134 40 94
5 SiteB 2015 247 143 104
6 SiteC 2015 220 95 125
7 SiteA 2016 34 20 14
8 SiteB 2016 8 8 0
9 SiteC 2016 62 45 17
The part from the second table to the third I've got; the problem is in how to get from the first table to the second. It would seem like you could pivot the first and then fill in the missing dates with fill(.direction="updown") except that the dates are the grouping value you need to be following.
For this example we could do like this:
library(tidyverse)
df_helper <- df %>%
slice(1:2) %>%
pivot_longer(cols= everything()) %>%
fill(value, .direction = "up") %>%
mutate(x = lead(value, 11)) %>%
drop_na() %>%
unite("name", c(value, x), sep = " ", remove = FALSE) %>%
pivot_wider(names_from = name)
df %>%
setNames(names(df_helper)) %>%
rename(Location = x) %>%
slice(-c(1:2))
Location 2014 Total 2014 F 2014 M 2015 Total 2015 F 2015 M 2016 Total 2016 F 2016 M
3 SiteA 180 92 88 134 40 94 34 20 14
4 SiteB <NA> <NA> <NA> 247 143 104 8 8 0
5 SiteC 237 194 43 220 95 125 62 45 17
I want to import data into R but I am getting a few errors. I download my ".CSV" file to my computer and specify the file path like this setwd("C:/Users/intellipaat/Desktop/BLOG/files") and then I am writing read.data <- read.csv("file1.csv"), but the console returns an error like this.
"read.data<-read.csv(file1.csv)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
'file1.csv' object not found
What should I do for this? I tried the internet link route, but again I encountered a problem.
I wrote like this:
install.packages("XML")
install.packages("RCurl")
to load the packages, run the following command:
library("XML")
library("RCurl")
url <- "https://en.wikipedia.org/wiki/Ease_of_doing_business_index#Ranking"
tabs <- getURL(url)
and the console wrote me this error;
Error in function (type, msg, asError = TRUE) :
error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version
I would be glad if you help me in this regard...
The Ease of Doing Business rankings table on Wikipedia is an HTML table, not a comma separated values file.
Loading the HTML table into an R data frame can be handled in a relatively straightforward manner with the rvest package. Instead of downloading the HTML file we can read it directly into R with read_html(), and then use html_table() to extract the tabular data into a data frame.
library(rvest)
wiki_url <- "https://en.wikipedia.org/wiki/Ease_of_doing_business_index#Ranking"
aPage <- read_html(wiki_url)
aTable <- html_table(aPage)[[2]] # second node is table of rankings
head(aTable)
...and the first few rows of output:
> head(aTable)
Classification Jurisdiction 2020 2019 2018 2017 2016 2015 2014 2013 2012
1 Very Easy New Zealand 1 1 1 1 2 2 3 3 3
2 Very Easy Singapore 2 2 2 2 1 1 1 1 1
3 Very Easy Hong Kong 3 4 5 4 5 3 2 2 2
4 Very Easy Denmark 4 3 3 3 3 4 5 5 5
5 Very Easy South Korea 5 5 4 5 4 5 7 8 8
6 Very Easy United States 6 8 6 8 7 7 4 4 4
2011 2010 2009 2008 2007 2006
1 3 2 2 2 2 1
2 1 1 1 1 1 2
3 2 3 4 4 5 7
4 6 6 5 5 7 8
5 16 19 23 30 23 27
6 5 4 3 3 3 3
>
Next, we confirm that the last countries were read correctly: Libya, Yemen, Venezuela, Eritrea, and Somalia.
> tail(aTable,n=5)
Classification Jurisdiction 2020 2019 2018 2017 2016 2015 2014 2013 2012
186 Below Average Libya 186 186 185 188 188 188 187 N/A N/A
187 Below Average Yemen 187 187 186 179 170 137 133 118 99
188 Below Average Venezuela 188 188 188 187 186 182 181 180 177
189 Below Average Eritrea 189 189 189 189 189 189 184 182 180
190 Below Average Somalia 190 190 190 190 N/A N/A N/A N/A N/A
2011 2010 2009 2008 2007 2006
186 N/A N/A N/A N/A N/A N/A
187 105 99 98 113 98 90
188 172 177 174 172 164 120
189 180 175 173 171 170 137
190 N/A N/A N/A N/A N/A N/A
Finally, we use tidyr and dplyr to convert the data to narrow format tidy data for subsequent analysis.
library(dplyr)
library(tidyr)
aTable %>%
# convert years 2017 - 2020 to character because pivot_longer()
# requires all columns to be of same data type
mutate_at(3:6,as.character) %>%
pivot_longer(-c(Classification,Jurisdiction),
names_to="Year",values_to="Rank") %>%
# convert Rank and Year to numeric values (introducing NA values)
mutate_at(c("Rank","Year"),as.numeric) -> rankings
head(rankings)
...and the output:
> head(rankings)
# A tibble: 6 x 4
Classification Jurisdiction Year Rank
<chr> <chr> <dbl> <dbl>
1 Very Easy New Zealand 2020 1
2 Very Easy New Zealand 2019 1
3 Very Easy New Zealand 2018 1
4 Very Easy New Zealand 2017 1
5 Very Easy New Zealand 2016 2
6 Very Easy New Zealand 2015 2
>
I have a data set with closing and opening dates of public schools in California. Available here or dput() at the bottom of the question. The data also lists what type of school it is and where it is. I am trying to create a running total column which also takes into account school closings as well as school type.
Here is the solution I've come up with, which basically entails me encoding a lot of different 1's and 0's based on the conditions using ifelse:
# open charter schools
pubschls$open_chart <- ifelse(pubschls$Charter=="Y" & is.na(pubschls$ClosedDate)==TRUE, 1, 0)
# open public schools
pubschls$open_pub <- ifelse(pubschls$Charter=="N" & is.na(pubschls$ClosedDate)==TRUE, 1, 0)
# closed charters
pubschls$closed_chart <- ifelse(pubschls$Charter=="Y" & is.na(pubschls$ClosedDate)==FALSE, 1, 0)
# closed public schools
pubschls$closed_pub <- ifelse(pubschls$Charter=="N" & is.na(pubschls$ClosedDate)==FALSE, 1, 0)
lausd <- filter(pubschls, NCESDist=="0622710")
# count number open during each year
Then I subtract the columns from each other to get totals.
la_schools_count <- aggregate(lausd[c('open_chart','closed_chart','open_pub','closed_pub')],
by=list(year(lausd$OpenDate)), sum)
# find net charters by subtracting closed from open
la_schools_count$net_chart <- la_schools_count$open_chart - la_schools_count$closed_chart
# find net public schools by subtracting closed from open
la_schools_count$net_pub <- la_schools_count$open_pub - la_schools_count$closed_pub
# add running totals
la_schools_count$cum_chart <- cumsum(la_schools_count$net_chart)
la_schools_count$cum_pub <- cumsum(la_schools_count$net_pub)
# total totals
la_schools_count$total <- la_schools_count$cum_chart + la_schools_count$cum_pub
My output looks like this:
la_schools_count <- select(la_schools_count, "year", "cum_chart", "cum_pub", "pen_rate", "total")
year cum_chart cum_pub pen_rate total
1 1952 1 0 100.00000 1
2 1956 1 1 50.00000 2
3 1969 1 2 33.33333 3
4 1980 55 469 10.49618 524
5 1989 55 470 10.47619 525
6 1990 55 470 10.47619 525
7 1991 55 473 10.41667 528
8 1992 55 476 10.35782 531
9 1993 55 477 10.33835 532
10 1994 56 478 10.48689 534
11 1995 57 478 10.65421 535
12 1996 57 479 10.63433 536
13 1997 58 481 10.76067 539
14 1998 59 480 10.94620 539
15 1999 61 480 11.27542 541
16 2000 61 481 11.25461 542
17 2001 62 482 11.39706 544
18 2002 64 484 11.67883 548
19 2003 73 485 13.08244 558
20 2004 83 496 14.33506 579
21 2005 90 524 14.65798 614
22 2006 96 532 15.28662 628
23 2007 90 534 14.42308 624
24 2008 97 539 15.25157 636
25 2009 108 546 16.51376 654
26 2010 124 566 17.97101 690
27 2011 140 580 19.44444 720
28 2012 144 605 19.22563 749
29 2013 162 609 21.01167 771
30 2014 179 611 22.65823 790
31 2015 195 611 24.19355 806
32 2016 203 614 24.84700 817
33 2017 211 619 25.42169 830
I'm just wondering if this could be done in a better way. Like an apply statement to all rows based on the conditions?
dput:
structure(list(CDSCode = c("19647330100289", "19647330100297",
"19647330100669", "19647330100677", "19647330100743", "19647330100750"
), OpenDate = structure(c(12324, 12297, 12240, 12299, 12634,
12310), class = "Date"), ClosedDate = structure(c(NA, 15176,
NA, NA, NA, NA), class = "Date"), Charter = c("Y", "Y", "Y",
"Y", "Y", "Y")), .Names = c("CDSCode", "OpenDate", "ClosedDate",
"Charter"), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
I followed your code and learned what you were doing except pen_rate. It seems that pen_rate is calculated dividing cum_chart by total. I download the original data set and did the following. I called the data set foo. Whenclosed_pub), I combined Charter and ClosedDate. I checked if ClosedDate is NA or not, and converted the logical output to numbers (1 = open, 0 = closed). This is how I created the four groups (i.e., open_chart, closed_chart, open_pub, and closed_pub). I guess this would ask you to do less typing. Since the dates are in character, I extracted year using substr(). If you have a date object, you need to do something else. Once you have year, you group the data with it and calculate how many schools exist for each type of school using count(). This part is the equivalent of your aggregate() code. Then, Convert the output to a wide-format data with spread() and did the rest of the calculation as you demonstrated in your codes. The final output seems different from what you have in your question, but my outcome was identical to one that I obtained by running your codes. I hope this will help you.
library(dplyr)
library(tidyr)
library(readxl)
# Get the necessary data
foo <- read_xls("pubschls.xls") %>%
select(NCESDist, CDSCode, OpenDate, ClosedDate, Charter) %>%
filter(NCESDist == "0622710" & (!Charter %in% NA))
mutate(foo, group = paste(Charter, as.numeric(is.na(ClosedDate)), sep = "_"),
year = substr(OpenDate, star = nchar(OpenDate) - 3, stop = nchar(OpenDate))) %>%
count(year, group) %>%
spread(key = group, value = n, fill = 0) %>%
mutate(net_chart = Y_1 - Y_0,
net_pub = N_1 - N_0,
cum_chart = cumsum(net_chart),
cum_pub = cumsum(net_pub),
total = cum_chart + cum_pub,
pen_rate = cum_chart / total)
# A part of the outcome
# year N_0 N_1 Y_0 Y_1 net_chart net_pub cum_chart cum_pub total pen_rate
#1 1866 0 1 0 0 0 1 0 1 1 0.00000000
#2 1873 0 1 0 0 0 1 0 2 2 0.00000000
#3 1878 0 1 0 0 0 1 0 3 3 0.00000000
#4 1881 0 1 0 0 0 1 0 4 4 0.00000000
#5 1882 0 2 0 0 0 2 0 6 6 0.00000000
#110 2007 0 2 15 9 -6 2 87 393 480 0.18125000
#111 2008 2 8 9 15 6 6 93 399 492 0.18902439
#112 2009 1 9 4 15 11 8 104 407 511 0.20352250
#113 2010 5 26 5 21 16 21 120 428 548 0.21897810
#114 2011 2 16 2 18 16 14 136 442 578 0.23529412
#115 2012 2 27 3 7 4 25 140 467 607 0.23064250
#116 2013 1 5 1 19 18 4 158 471 629 0.25119237
#117 2014 1 3 1 18 17 2 175 473 648 0.27006173
#118 2015 0 0 2 18 16 0 191 473 664 0.28765060
#119 2016 0 3 0 8 8 3 199 476 675 0.29481481
#120 2017 0 5 0 9 9 5 208 481 689 0.30188679
I have a dataframe that looks more or less like follows (the original one has 12 years of data):
Year Quarter Age_1 Age_2 Age_3 Age_4
2005 1 158 120 665 32
2005 2 257 145 121 14
2005 3 68 69 336 65
2005 4 112 458 370 101
2006 1 75 457 741 26
2006 2 365 134 223 45
2006 3 257 121 654 341
2006 4 175 124 454 12
2007 1 697 554 217 47
2007 2 954 987 118 54
2007 4 498 235 112 65
Where the numbers in the age columns represents the amount of individuals in each age class for a specific quarter within a specific year. It is noteworthy that sometimes not all quarters in a specific year have data (e.g., third quarter is not represented in 2007). Also, each row represents a sampling event. Although not shown in this example, in the original dataset I always have more than one sampling event for a specific quarter within a specific year. For example, for the first quarter in 2005 I have 47 sampling events, leading therefore to 47 rows.
What I´d like to have now is a dataframe structured in a way like:
Year Quarter Age_1 Age_2 Age_3 Age_4 Cohort
2005 1 158 120 665 32 158
2005 2 257 145 121 14 257
2005 3 68 69 336 65 68
2005 4 112 458 370 101 112
2006 1 75 457 741 26 457
2006 2 365 134 223 45 134
2006 3 257 121 654 341 121
2006 4 175 124 454 12 124
2007 1 697 554 217 47 47
2007 2 954 987 118 54 54
2007 4 498 235 112 65 65
In this case, I want to create a new column (Cohort) in my original dataset which basically follows my cohorts along my dataset. In other words, when I´m in my first year of data (2005 with all quarters), I take the row values of Age_1 and paste it into the new column. When I move to the next year (2006), then I take all my row values related to my Age_2 and paste it to the new column, and so on and so forth.
I have tried to use the following function, but somehow it only works for the first couple of years:
extract_cohort_quarter <- function(d, yearclass=2005, quarterclass=1) {
ny <- 1:nlevels(d$Year) #no. of Year levels in the dataset
nq <- 1:nlevels(d$Quarter)
age0 <- (paste("age", ny, sep="_"))
year0 <- as.character(yearclass + ny - 1)
quarter <- as.character(rep(1:4, length(age0)))
age <- rep(age0,each=4)
year <- rep(year0,each=4)
df <- data.frame(year,age,quarter,stringsAsFactors=FALSE)
n <- nrow(df)
dnew <- NULL
for(i in 1:n) {
tmp <- subset(d, Year==df$year[i] & Quarter==df$quarter[i])
tmp$Cohort <- tmp[[age[i]]]
dnew <- rbind(dnew, tmp)
}
levels(dnew$Year) <- paste("Yearclass_", yearclass, ":",
year,":",quarter,":", age, sep="")
dnew
}
I have plenty of data from age_1 to age_12 for all the years and quarters, so I don´t think that it´s something related to the data structure itself.
Is there an easier solution to solve this problem? Or is there a way to improve my extract_cohort_quarter() function? Any help will be much appreciated.
-M
I have a simple solution but that demands bit of knowledge of the data.table library. I think you can easily adapt it to your further needs.
Here is the data:
DT <- as.data.table(list(Year = c(2005, 2005, 2005, 2005, 2006, 2006 ,2006 ,2006, 2007, 2007, 2007),
Quarter= c(1, 2, 3, 4 ,1 ,2 ,3 ,4 ,1 ,2 ,4),
Age_1 = c(158, 257, 68, 112 ,75, 365, 257, 175, 697 ,954, 498),
Age_2= c(120 ,145 ,69 ,458 ,457, 134 ,121 ,124 ,554 ,987, 235),
Age_3= c(665 ,121 ,336 ,370 ,741 ,223 ,654 ,454,217,118,112),
Age_4= c(32,14,65,101,26,45,341,12,47,54,65)
))
Here is th code :
DT[,index := .GRP, by = Year]
DT[,cohort := get(paste0("Age_",index)),by = Year]
and the output:
> DT
Year Quarter Age_1 Age_2 Age_3 Age_4 index cohort
1: 2005 1 158 120 665 32 1 158
2: 2005 2 257 145 121 14 1 257
3: 2005 3 68 69 336 65 1 68
4: 2005 4 112 458 370 101 1 112
5: 2006 1 75 457 741 26 2 457
6: 2006 2 365 134 223 45 2 134
7: 2006 3 257 121 654 341 2 121
8: 2006 4 175 124 454 12 2 124
9: 2007 1 697 554 217 47 3 217
10: 2007 2 954 987 118 54 3 118
11: 2007 4 498 235 112 65 3 112
What it does:
DT[,index := .GRP, by = Year]
creates an index for all different year in your table (by = Year makes an operation for group of year, .GRP create an index following the grouping sequence).
I use it to call the column that you named Age_ with the number created
DT[,cohort := get(paste0("Age_",index)),by = Year]
You can even do everything in the single line
DT[,cohort := get(paste0("Age_",.GRP)),by = Year]
I hope it helps
Here is an option using tidyverse
library(dplyr)
library(tidyr)
df1 %>%
gather(key, Cohort, -Year, -Quarter) %>%
separate(key, into = c('key1', 'key2')) %>%
mutate(ind = match(Year, unique(Year))) %>%
group_by(Year) %>%
filter(key2 == Quarter[ind]) %>%
mutate(newcol = paste(Year, Quarter, paste(key1, ind, sep="_"), sep=":")) %>%
ungroup %>%
select(Cohort, newcol) %>%
bind_cols(df1, .)
# Year Quarter Age_1 Age_2 Age_3 Age_4 Cohort newcol
#1 2005 1 158 120 665 32 158 2005:1:Age_1
#2 2005 2 257 145 121 14 257 2005:2:Age_1
#3 2005 3 68 69 336 65 68 2005:3:Age_1
#4 2005 4 112 458 370 101 112 2005:4:Age_1
#5 2006 1 75 457 741 26 457 2006:1:Age_2
#6 2006 2 365 134 223 45 134 2006:2:Age_2
#7 2006 3 257 121 654 341 121 2006:3:Age_2
#8 2006 4 175 124 454 12 124 2006:4:Age_2
#9 2007 1 697 554 217 47 47 2007:1:Age_3
#10 2007 2 954 987 118 54 54 2007:2:Age_3
#11 2007 4 498 235 112 65 65 2007:4:Age_3
So I have a dataset which simplified looks something like this:
Year ID Sum
2009 999 100
2009 123 85
2009 666 100
2009 999 100
2009 123 90
2009 666 85
2010 999 100
2010 123 100
2010 666 95
2010 999 75
2010 123 100
2010 666 85
I'd like to add a column with the cumulative sum, by year and ID. Like this:
Year ID Sum Cum.Sum
2009 999 100 100
2009 123 85 85
2009 666 100 100
2009 999 100 200
2009 123 90 175
2009 666 85 185
2010 999 100 100
2010 123 100 100
2010 666 95 95
2010 999 75 175
2010 123 100 200
2010 666 85 180
I think this should be pretty straight-forward, but somehow I haven't been able to figure it out. How do I do this? Thanks for the help!
Using data.table:
require(data.table)
DT <- data.table(DF)
DT[, Cum.Sum := cumsum(Sum), by=list(Year, ID)]
Year ID Sum Cum.Sum
1: 2009 999 100 100
2: 2009 123 85 85
3: 2009 666 100 100
4: 2009 999 100 200
5: 2009 123 90 175
6: 2009 666 85 185
7: 2010 999 100 100
8: 2010 123 100 100
9: 2010 666 95 95
10: 2010 999 75 175
11: 2010 123 100 200
12: 2010 666 85 180
Another way
1) use ddply to sum a variable by group (similar to SQL group by)
X <- ddply ( dataset, .(Year,ID), sum)
2) merge the result with dataset
Y <- merge( dataset, X, by=('Year','ID')
You can use dplyr, and the base function cumsum:
require(dplyr)
dataset %>%
group_by(Year, ID) %>%
mutate(cumsum = cumsum(Sum)) %>%
ungroup()