Why does dplyr put quotes around my variables - r

I'm using "pivot_wider" (also tried "spread") function in dplyr to rearrange some data. But when I use the function, the output variables get weird ` ` around them (shown below). I don't see any mention of this in the documentation and I haven't found anything on it on StackExchange. I've tried changing the variables into num and int before running the function, but it doesn't seem to have any effect on the quotes.
The problem is that if I want to do any operations on the new variables, I now need to write it as `2014`, which gets old fast. Am I doing something wrong, or is there something I need to do to my data before/after I run this?
Original data
ID group var
<dbl> <chr> <dbl>
1 4548 2014 18
2 4549 2014 19
3 4550 2015 20
pivot_wider(names_from=group, values_from=var)
Original data
ID `2014` `2015`
<dbl> <dbl> <dbl>
1 4548 18 NA
2 4549 19 NA
3 4550 NA 20

Related

Strsplit function in R

I'm currently working through some coursework where data has been provided on supermarket chip sales. Part of the task is to remove any entries where products are not chips and have been provided code to enter to help with this:
productWords <- data.table(unlist(strsplit(unique(transaction_data[, "PROD_NAME"]), "")))
the data file provided = transaction_data and PROD_NAME variable is the column we're interested in.
This however, returns the error:
Error in strsplit(unique(transaction_data[, "PROD_NAME"]), "") : non-character argument
Can someone please explain what it is that I'm doing wrong, or am I missing something? I'm not sure how this code would be able to understand the product and differentiate between another, am I meant to be adding something in with the code based on product names I've seen while looking through the data?
Here are some lines of the data as an example:
DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES
<date> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 2018-10-17 1 1000 1 5 Natural Chip Compny SeaSalt175g 2 6
2 2019-05-14 1 1307 348 66 CCs Nacho Cheese 175g 3 6.3
3 2019-05-20 1 1343 383 61 Smiths Crinkle Cut Chips Chicken 170g 2 2.9

Convert Tables to Data Frames with Loop in R

I have over 100 tables. Each ID has multiple columns (ID, Date, Days, Mass, Float, Date 2, Days 2, pH).
I split the IDs from the data frame and made them the names of the tables as shown below.
data = NN
ID <- paste0("", NN$ID)
SD<- split(NN,ID)
SD
Each of the ID's look as follows
> SD$`4469912`
# A tibble: 5 × 8
ID Date Days Mass Float `Date 2` `Days 2` pH
<dbl> <dttm> <dbl> <dbl> <dbl> <dttm> <dbl> <chr>
1 4469912 2022-05-24 00:00:00 0 440 16.9 NA 0 NA
2 4469912 2022-05-27 00:00:00 3 813 NA NA 0 NA
3 4469912 2022-06-02 00:00:00 9 930 NA NA 0 NA
4 4469912 2022-06-03 00:00:00 10 914. NA NA 0 NA
5 4469912 2022-06-06 00:00:00 13 944 NA NA 0 NA
I would like to convert each ID to its own Dataframe as shown below
`4469912`<- data.frame(SD$`4469912`)
AKA
`4469912`<- data.frame(SD[9])
The problem I am running into is running a loop to create each table as its own data frame. I would like to name the data frames to their corresponding ID. Something along the lines of the code below.
for (x in SD) {
names(SD[x]) <- data.frame(SD[x])
}
EDIT: I will add that the end goal is to pull or select specific IDs to then plot them on top or against one another in ggplot as each ID is its own geom_line for example:
`4469912`<- data.frame(SD$`4469912`)
`4469822`<- data.frame(SD$`4469822`)
`4469222`<- data.frame(SD$`4469222`)
ggplot(data=NULL,aes(x=`Date`,y=`Mass`)) +
geom_line(data = `4469912`,aes(col="red"))+
geom_line(data = `4469822`,aes(col="blue"))+
geom_line(data = `4469222`,aes(col="green"))
Rather than plotting the entirety of my original data frame, I can determine falloff or regression between the IDs rather than the entirety of the data points selected; if that makes sense and/or is relevant.

Obtaining data from Spotify Top Charts using spotifyr

I'm trying to obtain the audio features for the top 200 charts of all of 2017 using the spotifyr package on R, I tried:
days<- spotifycharts::chartdaily()
for (i in days) {
spotifycharts::chart_top200_daily(region = "global",days = "days[i]")
}
to obtain the top 200 daily for all of 2017, but I was unable to do it.
Can someone help me? :(
It works, if you turn days from tibble into vector:
days <- unlist(chart_daily())
lapply(days[1:3], function(i) chart_top200_daily("global", days = i))
But it parse data badly, so there will be problems with variable names, etc:
# A tibble: 6 x 5
x1 x2 x3 note.that.these.figures.are.generated.… x5
<int> <chr> <chr> <int> <chr>
1 NA Track Name Artist NA URL
2 1 thank u, next Ariana… 8293841 https://open.spoti…
3 2 Taki Taki (with S… DJ Sna… 5467625 https://open.spoti…
4 3 MIA (feat. Drake) Bad Bu… 3955367 https://open.spoti…
5 4 Happier Marshm… 3357435 https://open.spoti…
6 5 BAD XXXTEN… 3131745 https://open.spoti…

How to read in Data from Messy Excel Books

I've been dealing with patient and financial data from a hospital. The data is stored in .xlsx excel books. There are multiple pages within each sheet stretching horizontally and vertically. Some of the columns have neatly defined names as you would want for R but then others do not or have text in between and not to mention what appear to be randomly. At times
a section has a title which is the result of multiple rows being formatted into one singular row.
Unfortunately, I cannot show the data due to confidentiality. Is there anyway around this when the data is far from being in a tidy format?
So far I have been copying and pasting the data into a new CSV.
While this was effective I felt that it was largely inefficient.Is this the best approach to take?
Help would be much appreciated
Thanks
EDIT
As I cannot show data this is the best I can show
Hi #Paul
So Let me give a rough example
Jan Feb March April
Income X 1 2 3 4
Income Y 2 4 4 6
Expenditure
Jan Feb March April Another table here also
Expense 1 3 5 7
Expense 5 6 7 8
(Excel Bar chart)
Look at the readxl package, the range option might be what you're looking for:
library(readxl)
df1 <- read_xlsx("C:\\Users\\...\\Desktop\\Book1.xlsx", range = "A1:D3")
# # A tibble: 2 x 4
# Jan Feb March April
# <dbl> <dbl> <dbl> <dbl>
# 1 1 3 5 7
# 2 5 6 7 8
df2 <- read_xlsx("C:\\Users\\...\\Desktop\\Book1.xlsx", range = "B6:E8")
# # A tibble: 2 x 4
# Jan Feb March April
# <dbl> <dbl> <dbl> <dbl>
# 1 1 3 5 7
# 2 5 6 7 8

From panel data to cross-sectional data using averages

I am very new to R so I am not sure how basic my question is, but I am stuck at the following point.
I have data that has a panel structure, similar to this
Country Year Outcome Country-characteristic
A 1990 10 40
A 1991 12 40
A 1992 14 40
B 1991 10 60
B 1992 12 60
For some reason I need to put this in a cross-sectional structure such I get averages over all years for each country, that is in the end, it should look like,
Country Outcome Country-Characteristic
A 12 40
B 11 60
Has anybody faced a similar problem? I was playing with lapply(table$country, table$outcome, mean) but that did not work as I wanted it.
Two tips: 1- When you ask a question, you should provide a reproducible example for the data too (as I did with read.table below). 2- It's not a good idea to use "-" in column names. You should use "_" instead.
You can get a summary using the dplyr package:
df1 <- read.table(text="Country Year Outcome Countrycharacteristic
A 1990 10 40
A 1991 12 40
A 1992 14 40
B 1991 10 60
B 1992 12 60", header=TRUE, stringsAsFactors=FALSE)
library(dplyr)
df1 %>%
group_by(Country) %>%
summarize(Outcome=mean(Outcome),Countrycharacteristic=mean(Countrycharacteristic))
# A tibble: 2 x 3
Country Outcome Countrycharacteristic
<chr> <dbl> <dbl>
1 A 12 40
2 B 11 60
We can do this in base R with aggregate
aggregate(.~Country, df1[-2], mean)
# Country Outcome Countrycharacteristic
#1 A 12 40
#2 B 11 60

Resources