How to read in Data from Messy Excel Books - r

I've been dealing with patient and financial data from a hospital. The data is stored in .xlsx excel books. There are multiple pages within each sheet stretching horizontally and vertically. Some of the columns have neatly defined names as you would want for R but then others do not or have text in between and not to mention what appear to be randomly. At times
a section has a title which is the result of multiple rows being formatted into one singular row.
Unfortunately, I cannot show the data due to confidentiality. Is there anyway around this when the data is far from being in a tidy format?
So far I have been copying and pasting the data into a new CSV.
While this was effective I felt that it was largely inefficient.Is this the best approach to take?
Help would be much appreciated
Thanks
EDIT
As I cannot show data this is the best I can show
Hi #Paul
So Let me give a rough example
Jan Feb March April
Income X 1 2 3 4
Income Y 2 4 4 6
Expenditure
Jan Feb March April Another table here also
Expense 1 3 5 7
Expense 5 6 7 8
(Excel Bar chart)

Look at the readxl package, the range option might be what you're looking for:
library(readxl)
df1 <- read_xlsx("C:\\Users\\...\\Desktop\\Book1.xlsx", range = "A1:D3")
# # A tibble: 2 x 4
# Jan Feb March April
# <dbl> <dbl> <dbl> <dbl>
# 1 1 3 5 7
# 2 5 6 7 8
df2 <- read_xlsx("C:\\Users\\...\\Desktop\\Book1.xlsx", range = "B6:E8")
# # A tibble: 2 x 4
# Jan Feb March April
# <dbl> <dbl> <dbl> <dbl>
# 1 1 3 5 7
# 2 5 6 7 8

Related

How do I make new columns for data across time using the variables from an original column? [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 9 months ago.
I have R code that takes raw data where each patient entry is one row, and sums it up for a 'frequency' column for each department by date.
What I used here was the code:
department_totals <- as.data.frame(count(sheet, c("Date", "Department")))
To get:
Department
Date
Frequency
Dental
14 Mar
5
Dental
15 Mar
3
Dental
16 Mar
2
Cardio
14 Mar
4
Cardio
15 Mar
7
Cardio
16 Mar
8
Physio
14 Mar
1
Physio
16 Mar
2
But for this new project, I need it to be the actual individual departments by date, like this:
Date
Dental
Cardio
Physio
14 Mar
5
4
1
15 Mar
3
7
blank
16 Mar
2
8
2
And I can't figure out how to do it. I can group by department, but I'm trying to make each unique variable in 'Department' its own variable and then have the frequency of variables for each of those as a new column, ordered by date.
The intent here is to be able to make line graphs of how each of these departments' frequency of patients changes over time.
library(tidyverse)
df %>% pivot_wider(names_from = Department, values_from = Frequency)
# A tibble: 3 x 4
Date Dental Cardio Physio
<chr> <int> <int> <int>
1 14_Mar 5 4 1
2 15_Mar 3 7 NA
3 16_Mar 2 8 2

Is there a way I can get the maximum value for each group after a double group_by in R?

I am trying to extract the team with the maximum number of wins each year in women's college basketball, and I am currently stuck with having the number of wins for each year for each team, and I want only the team with the maximum number of wins in each year.
winsbyyear <- WomenCBnewdf %>%
group_by(Year,Team)%>%
summarise(totalwinsyr = sum(Outcome))
Output currently looks like this, but I am expecting to see each year only once with the team with the maximum number of wins in the subsequent columns
Year Team totalwinsyr
<fct> <chr> <dbl>
1 2014 AbileneChristian 10
2 2014 AirForce 0
3 2014 Akron 18
4 2014 Alabama 10
5 2014 AlabamaAM 3
6 2014 AlabamaHuntsville 0
7 2014 AlabamaMobile 0
8 2014 AlabamaSt 15
9 2014 AlaskaAnchorage 1
10 2014 AlbanyNY 16
How to select the rows with maximum values in each group with dplyr?
I have already looked here but I could not find any resources to help with a group_by() with multiple values
Create a new column with the number of wins and then filter:
winsbyyear <- WomenCBnewdf %>%
group_by(Year,Team)%>%
mutate(totalwinsyr = sum(Outcome)) %>%
filter(totalwinsyr == max(totalwinsyr))

Finding monthly average from weekly data for every company

Hello I am new to R and I'm trying to find the monthly average of ownership data from weekly data for every company. It consists of 3 different sheets of weekly data from 2009 to 2020 for many companies and I merged them all together into one data frame. The data looks something like this, "tarih" means "Date", mbr_id represents the companies, and "mulkiyet_bakiye" represents the ownership level that I'm trying find the monthly average of.
> head(df)
# A tibble: 6 x 3
tarih mbr_id mulkiyet_bakiye
<date> <chr> <dbl>
1 2009-01-02 A 1083478.
2 2009-01-02 B 1624843.
3 2009-01-02 C 90340363.
4 2009-01-02 D 2128114.
5 2009-01-02 E 47541783.
6 2009-01-02 F 268874.
I've tried something like this so far: (This solution was for another problem, but I thought maybe it would work for this one)
df$tarih <- as.Date(df$tarih, format = '%Y-%m-%d')
monthly_average <- df %>%
mutate(year = year(tarih), month = month(tarih), week = week(tarih)) %>%
unite_("date", c("year", "month", "week"), sep ="-") %>%
group_by(date, mbr_id) %>%
summarise(monthly_mean_owner = mean(mulkiyet_bakiye)) %>%
arrange(mbr_id)
However, the result looks like this:
> head(monthly_average,10)
# A tibble: 10 x 3
# Groups: date [10]
date mbr_id monthly_mean_owner
<chr> <chr> <dbl>
1 2009-1-1 A 1083478.
2 2009-1-2 A 1083478.
3 2009-1-3 A 1083478.
4 2009-1-4 A 1083478.
5 2009-1-5 A 1083588.
6 2009-10-40 A 993589.
7 2009-10-41 A 993589.
8 2009-10-42 A 993589.
9 2009-10-43 A 993589.
10 2009-10-44 A 993589.
I think I've made mistake while arranging the dates, but I don't know how to fix it.
Could someone help me do that? (Or another way to do this calculation?)
Thanks and appreciating your response.

Obtaining data from Spotify Top Charts using spotifyr

I'm trying to obtain the audio features for the top 200 charts of all of 2017 using the spotifyr package on R, I tried:
days<- spotifycharts::chartdaily()
for (i in days) {
spotifycharts::chart_top200_daily(region = "global",days = "days[i]")
}
to obtain the top 200 daily for all of 2017, but I was unable to do it.
Can someone help me? :(
It works, if you turn days from tibble into vector:
days <- unlist(chart_daily())
lapply(days[1:3], function(i) chart_top200_daily("global", days = i))
But it parse data badly, so there will be problems with variable names, etc:
# A tibble: 6 x 5
x1 x2 x3 note.that.these.figures.are.generated.… x5
<int> <chr> <chr> <int> <chr>
1 NA Track Name Artist NA URL
2 1 thank u, next Ariana… 8293841 https://open.spoti…
3 2 Taki Taki (with S… DJ Sna… 5467625 https://open.spoti…
4 3 MIA (feat. Drake) Bad Bu… 3955367 https://open.spoti…
5 4 Happier Marshm… 3357435 https://open.spoti…
6 5 BAD XXXTEN… 3131745 https://open.spoti…

New column from non-standard date factor in R

I have a dataframe with an oddly formatted dates column. I'd like to create a column just showing the year from the original date column and I am having trouble coming up with a way to do this because the current date column is being treated as a factor. Any advice on how to do this efficiently would be appreciated.
Example
starting with:
org <- c("a","b","c","d")
country <- c("1","2","3","4")
date <- c("01-09-14","01-10-07","11-31-99","10-31-12")
toy <- data.frame(cbind(org,country,date))
toy
org country date
1 a 1 01-09-14
2 b 2 01-10-07
3 c 3 11-31-99
4 d 4 10-31-12
str(toy$date)
Factor w/ 4 levels "01-09-14","01-10-07",..: 1 2 4 3
Desired result:
org country Year
1 a 1 2014
2 b 2 2007
3 c 3 1999
4 d 4 2012
This should work:
transform(toy,Year=format(strptime(date,"%m-%d-%y"),"%Y"))
This produces
## org country date Year
## 1 a 1 01-09-14 2014
## 2 b 2 01-10-07 2007
## 3 c 3 11-31-99 <NA>
## 4 d 4 10-31-12 2012
I initially thought that the NA value was because the %y format indicator wasn't smart enough to handle previous-century dates, but ?strptime says:
‘%y’ Year without century (00-99). On input, values 00 to 68 are
prefixed by 20 and 69 to 99 by 19 - that is the behaviour
specified by the 2004 and 2008 POSIX standards, but they do
also say ‘it is expected that in a future version the default
century inferred from a 2-digit year will change’.
implying that it should be able to handle it.
The problem is actually that 31 November doesn't exist ...
(You can drop the date column at your leisure ...)

Resources