R spread across multiple value columns - r

My dataset looks like this -
dataset = data.frame(Site=c(rep('A',6),rep('B',6)),Date=c(rep(c('2019-05-31','2019-04-30','2019-03-31'),4)),Question=c(rep('Q1',3),rep('Q2',3)),Score=runif(12,0.5,1),Average=runif(12,0.5,1))
I'd like to spread columns in such a way that the the first two columns contain the Site and Question and the remaining columns are have the Score_Date and Average_Date
Here's an example of what the first line of the resulting table would look like
Site Question Score_2019.03.31 Score_2019.04.30 Score_2019.05.31 Average_2019.03.31 Average_2019.04.30 Average_2019.05.31
A Q1 0.9117566 0.8661078 0.5624139 0.7246694 0.8870703 0.6401099
I tried using unite & spread from tidyr but nowhere close to the result
Any inputs would be highly appreciated

Using tidyr and dplyr from the tidyverse, you could do the following:
library(tidyverse)
dataset %>%
nest(Score, Average, .key = 'value_col') %>%
spread(key = Date, value = value_col) %>%
unnest(`2019-03-31`, `2019-04-30`, `2019-05-31`, .sep = "_")

Related

Lead and lag issue using dplyr

I have a data frame with data that looks like this that has 365 rows reflecting the calendar year. I am trying to shift the county name columns up by one row. The data frame doesn't contain any missing values.
I tried using the following code to shift it, but the resulting table has values that are all NA.
covid_shift <- covid_pivot %>%
mutate(Maricopa = lag(Maricopa), Cook = lag(Cook), Harris = lag(Harris))
Does anyone know what might be the issue?
Since covid_pivot is grouped by date, and each of these groups has one row, the lead and lag functions return NA.
Try:
covid_shift <- covid_pivot %>%
ungroup() %>%
mutate(Maricopa = lag(Maricopa), Cook = lag(Cook), Harris = lag(Harris))
You might also consider using across()
covid_pivot %>%
ungroup() %>%
mutate(across(-date, ~lag(.x)))

Transpose my R Dataset for association analysis

I am sort of a newbie with R and data manipulation and I am trying to transpose the UCI words dataset. The default dataset is currently structured as so.
Where the first column is the document number, the second column is the word number referencing another text file and the last column is the number of times the word occurs in the document. (For now, we can forget about the third column and I know how to drop it from the dataset.)
What I am trying to do is to transpose the dataset so that I can have each document's words in one row. So a simple example would be like this.
I tried using the t() function but it would transpose the entire dataset all together which is not what I want. I looked in using the dplyr package to help with the data manipulation but I am not getting any solid leads. If you guys have any sources or a particular direction you can nudge me towards accomplishing this that would helpfull.
Thank you!
Here's a solution using the tidyverse package (which includes dplyr). The trick is to first add another column to differentiate entries with the same value in the first column (document number) and then just change the data to wide format using pivot_wider.
library(tidyverse)
# Your data
df <- read.csv(text = "num word
1 61
2 76
1 89
3 211
3 296", sep = " ")
df %>%
# Group by num
group_by(num) %>%
# Add a rownumber to differentiate entries for the same first column value
mutate(rownum = row_number()) %>%
# Change data to wide format
pivot_wider(id = num,
names_from = rownum,
values_from = word)
So I was able to figure out how to accomplish this task. Hopefully, it helps other DS's in the future.
data <- read.table("docword.kos.txt", sep = " ")
data <- data %>% select(V1, V2)
trans <- data %>%
group_by(V1) %>%
summarise(words = paste(V2, collapse = ","))
trans <- trans %>% select(words)
What I ended up doing is using the tidyr package to perform some data wrangling and group my dataset by the first column. Then I exported and re uploaded the dataset after making some slight adjustments in notepad (Replaced the " from the generated csv file)
write.csv(trans, "~\trend.csv", row.names = FALSE)

Better way to apply which.max over dataframe

so I'm trying to learn R while playing with a dataset from https://www.kaggle.com/abcsds/pokemon
data = read.csv("Pokemon.csv")
data$Name = sub(".*(Mega)", "Mega", data$Name) # replacing name duplications
And I want to find all the pokemon that have a maximum value on any columns (Total, Attack, HP, etc):
I know I can do this: sapply(data[5:11], max, na.rm = TRUE) to find out the max values and stuff like
data[which.max(data$Total),]
data[which.max(data$HP),]
data[which.max(data$Attack),]
to find all the rows that have a max.
Is there a way I can use something like sapply in order to get all the rows without going through them sequentially?
I believe this is what you want to achieve
I use tidyverse for this, as the data is in wide format with different columns for stat, I first convert it into long format using pivot_longer then I group_by stats column and filter the max of each group to achieve the desired result.
library(tidyverse)
df %>%
select(c(2, 5:11)) %>%
pivot_longer(-1, names_to = "stats") %>%
group_by(stats) %>%
filter(value == max(value))

How to summarize data by several groups (R)

I have a dataframe that looks like this:
And I want to get this, that is, a single row per Group, with a column for the % of As in all the ID_1_Subgroup for each Group, together with the sum of ValueSubgroup, for each group too):
Can someone help? I have seen other issues (like this: Summarizing by group and subgroup) which are similar but not for R.
We can do
library(dplyr)
df1 %>%
group_by(Group) %>%
summarise(PercA = mean(id_1_Subgroup == "A"),
SumOfValueSubgroup = sum(ValueSubgroup))

How do I aggregate certain columns from data frame by a Unique ID?

I have a list of statcast data, per day dating back to 2016. I am attempting to aggregate this data for finding the mean for each pitching ID.
I have the following code:
aggpitch <- aggregate(pitchingstat, by=list(pitchingstat$PitcherID),
FUN=mean, na.rm = TRUE)
This function aggregates every single column. I am looking to only aggregate a certain amount of columns.
How would I include only certain columns?
If you have more than one column that you'd like to summarize, you can use QAsena's approach and add summarise_at function like so:
pitchingstat %>%
group_by(PitcherID) %>%
summarise_at(vars(col1:coln), mean, na.rm = TRUE)
Check out link below for more examples:
https://dplyr.tidyverse.org/reference/summarise_all.html
Replace the first argument (pitchingstat) with the name of the column you want to aggregate (or a vector thereof)
How about?:
library(tidyverse)
aggpitch <- pitchingstat %>%
group_by(PitcherID) %>%
summarise(pitcher_mean = mean(variable)) #replace 'variable' with your variable of interest here
or
library(tidyverse)
aggpitch <- pitchingstat %>%
select(var_1, var_2)
group_by(PitcherID) %>%
summarise(pitcher_mean = mean(var_1),
pitcher_mean2 = mean(var_2))
I think this works but could use a dummy example of your data to play with.

Resources