reshape only 2 columns keeping multiple columns same - r

My sample data looks like:
time state district count category
2018-01-01 Telangana Nalgonda 17 Water
2018-01-01 Telangana Nalgonda 8 Irrigation
2018-01-01 Telangana Nalgonda 52 Seeds
2018-01-01 Telangana Nalgonda 28 Electricity
2018-01-01 Telangana Nalgonda 27 Storage
2018-01-01 Telangana Nalgonda 12 Pesticides
I've around 2 years of monthly data of different states and districts.
I would like to melt the data to wide format
Tried :
one <- reshape(dataset,idvar = c("time","state","district"),v.names = names(dataset$category),
timevar = "count"
, direction = "wide")
Expected Output :
time state district Water Irrigation Seeds Electricity Storage Pesticides
2018-01-01 Telangana Nalgonda 17 8 52 28 27 12
I'm not sure how exactly reshape package works. I've seen many examples but couldn't figure it out right explanations.
Can some one let me know what's wrong I'm doing.

We could use gather and spread
library(dplyr)
library(tidyr)
df %>%
gather(key, value, count) %>%
spread(category, value) %>%
select(-key)
# time state district Electricity Irrigation Pesticides Seeds Storage Water
#1 2018-01-01 Telangana Nalgonda 28 8 12 52 27 17

We can use data.table
library(data.table)
dcast(setDT(df1), time + state + district + rowid(count) ~
category, value.var = 'count')
# time state district count Electricity Irrigation Pesticides Seeds Storage Water
#1: 2018-01-01 Telangana Nalgonda 1 28 8 12 52 27 17
data
df1 <- structure(list(time = c("2018-01-01", "2018-01-01", "2018-01-01",
"2018-01-01", "2018-01-01", "2018-01-01"), state = c("Telangana",
"Telangana", "Telangana", "Telangana", "Telangana", "Telangana"
), district = c("Nalgonda", "Nalgonda", "Nalgonda", "Nalgonda",
"Nalgonda", "Nalgonda"), count = c(17L, 8L, 52L, 28L, 27L, 12L
), category = c("Water", "Irrigation", "Seeds", "Electricity",
"Storage", "Pesticides")), class = "data.frame", row.names = c(NA,
-6L))

Related

How can I merge different data sets in R knowing that the variable that I use for matching the two data set are not unique?

I have two datasets, and I need to merge them by the ID value. The problems are:
The ID value can be repeated across the same dataset (no other unique value is available).
The two datasets are not equal in the rows number or the column numbers.
Example:
df1
ID
Gender
99
Male
85
Female
7
Male
df2
ID
Body_Temperature
Body_Temperature_date_time
99
36
1/1/2020 12:00 am
99
38
2/1/2020 10:30 am
99
37
1/1/2020 06:41 am
52
38
1/2/2020 11:00 am
11
39
4/5/2020 09:09 pm
7
35
9/8/2020 02:30 am
How can I turn these two datasets into one single dataset in a way that allows me to apply some machine learning models on it later on?
Depending on your expected results, if you are wanting to return all rows from each dataframe, then you can use a full_join from dplyr:
library(dplyr)
full_join(df2, df1, by = "ID")
Or with base R:
merge(x=df2,y=df1,by="ID",all=TRUE)
Output
ID Body_Temperature Body_Temperature_date_time Gender
1 99 36 1/1/2020 12:00 am Male
2 99 38 2/1/2020 10:30 am Male
3 99 37 1/1/2020 06:41 am Male
4 52 38 1/2/2020 11:00 am <NA>
5 11 39 4/5/2020 09:09 pm <NA>
6 7 35 9/8/2020 02:30 am Male
7 85 NA <NA> Female
If you have more than 2 dataframes to combine, which only overlap with the ID column, then you can use reduce on a dataframe list (so put all the dataframes that you want to combine into a list):
library(tidyverse)
df_list <- list(df1, df2)
multi_full <- reduce(df_list, function(x, y, ...)
full_join(x, y, by = "ID", ...))
Or Reduce with base R:
df_list <- list(df1, df2)
multi_full <- Reduce(function(x, y, ...)
merge(x, y, by = "ID", all = TRUE, ...), df_list)
Data
df1 <- structure(list(ID = c(99L, 85L, 7L), Gender = c("Male", "Female",
"Male")), class = "data.frame", row.names = c(NA, -3L))
df2 <- structure(list(ID = c(99L, 99L, 99L, 52L, 11L, 7L), Body_Temperature = c(36L,
38L, 37L, 38L, 39L, 35L), Body_Temperature_date_time = c("1/1/2020 12:00 am",
"2/1/2020 10:30 am", "1/1/2020 06:41 am", "1/2/2020 11:00 am",
"4/5/2020 09:09 pm", "9/8/2020 02:30 am")), class = "data.frame", row.names = c(NA,
-6L))

How can I plot a dataframe with 10 columns?

I have a dataset looks like this
year china India United state ....
2020 30 40 50
2021 20 30 60
2022 34 20 40
....
I have 10 columns and more than 50 rows in this dataframe. I have to plot them in one graph to show the movement of different countries.
So I think line graph would be good for the purpose.But I don't know how should I do the visulisation.
I think I shuold change the dataframe format and then start visulisation. How should I do it?
Pivot (reshape from wide to long) then plot with groups.
dat <- structure(list(year = 2020:2022, China = c(30L, 20L, 34L), India = c(40L, 30L, 20L), UnitedStates = c(50L, 60L, 40L)), class = "data.frame", row.names = c(NA, -3L))
datlong <- reshape2::melt(dat, "year", variable.name = "country", value.name = "value")
datlong
# year country value
# 1 2020 China 30
# 2 2021 China 20
# 3 2022 China 34
# 4 2020 India 40
# 5 2021 India 30
# 6 2022 India 20
# 7 2020 UnitedStates 50
# 8 2021 UnitedStates 60
# 9 2022 UnitedStates 40
### or using tidyr::
tidyr::pivot_longer(dat, -year, names_to = "country", values_to = "value")
Once reshaped, just group= (and optionally color=) lines:
library(ggplot2)
ggplot(datlong, aes(year, value, color = country)) +
geom_line(aes(group = country))
If you have many more years, the decimal-years in the axis will likely smooth out. You can alternately control it by converting year to a Date-class and forcing the display with scale_x_date.

Writing function to help merge to different datasets

I am attempting to merge two different datasets: nflfastrpbp and routes.merging.
While both datasets have identifying factors for each game:
nflfastrpbp = game_id, old_game_id
routes.merging = GameID
... they are not matches.
Here is a look at the nflfastrpbp data:
A tibble: 48,514 x 8
game_id old_game_id week home_team away_team game_date pass_defense_1_player_id pass_defense_1_player_name
<chr> <chr> <int> <chr> <chr> <chr> <chr> <chr>
1 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
2 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
3 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
4 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
5 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
6 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
7 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
8 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
9 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
10 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
And here is a look at the routes.merging data:
# A tibble: 80,676 x 6
EventID GameID Season Week OffensiveTeam DefensiveTeam
<int> <int> <int> <int> <chr> <chr>
1 15 2793 2020 1 Texans Chiefs
2 15 2793 2020 1 Texans Chiefs
3 15 2793 2020 1 Texans Chiefs
4 15 2793 2020 1 Texans Chiefs
5 15 2793 2020 1 Texans Chiefs
6 25 2793 2020 1 Texans Chiefs
7 25 2793 2020 1 Texans Chiefs
8 25 2793 2020 1 Texans Chiefs
9 25 2793 2020 1 Texans Chiefs
10 45 2793 2020 1 Chiefs Texans
# ... with 80,666 more rows
What I am trying to do: I am attempting to get the game_id from the nflfastrpbp data onto the routes.merging data and, of course, matching it up with the correct games so that I can merge the two together (specifically to pull the pass_defense_player information from nflfastrpbp to routes.merging.)
I've been trying to write a function but cannot figure it out.
If it helps, here is reprex for each dataset (I will include the 2020_01_ARI_SF game from both for helping in matching).
nflfastrpbp reprex:
structure(list(game_id = c("2020_01_ARI_SF", "2020_01_ARI_SF",
"2020_01_ARI_SF", "2020_01_ARI_SF", "2020_01_ARI_SF"), old_game_id = c("2020091311",
"2020091311", "2020091311", "2020091311", "2020091311"), week = c(1L,
1L, 1L, 1L, 1L), home_team = c("SF", "SF", "SF", "SF", "SF"),
away_team = c("ARI", "ARI", "ARI", "ARI", "ARI"), game_date = c("2020-09-13",
"2020-09-13", "2020-09-13", "2020-09-13", "2020-09-13"),
pass_defense_1_player_id = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_), pass_defense_1_player_name = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))
routes.merging reprex:
structure(list(EventID = c(30L, 30L, 30L, 30L, 45L), GameID = c(2805L,
2805L, 2805L, 2805L, 2805L), Season = c(2020L, 2020L, 2020L,
2020L, 2020L), Week = c(1L, 1L, 1L, 1L, 1L), OffensiveTeam = c("49ers",
"49ers", "49ers", "49ers", "Cardinals"), DefensiveTeam = c("Cardinals",
"Cardinals", "Cardinals", "Cardinals", "49ers")), row.names = c(NA,
-5L), groups = structure(list(EventID = c(30L, 45L), GameID = c(2805L,
2805L), .rows = structure(list(1:4, 5L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = 1:2, class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
I hope all that made sense.
EDIT - Expected Outcome: The expected outcome is the routes.merging DF with a new column, id.for.merging, that is the game_id from the nflfastrpbp DF ... again, matched up correctly by game.
EventID GameID Season Week OffensiveTeam DefensiveTeam id.for.merging
<int> <int> <int> <int> <chr> <chr> <chr>
1 15 2793 2020 1 Texans Chiefs 2020_01_HOU_KC
2 15 2793 2020 1 Texans Chiefs 2020_01_HOU_KC
3 15 2793 2020 1 Texans Chiefs 2020_01_HOU_KC
4 15 2793 2020 1 Texans Chiefs 2020_01_HOU_KC
5 15 2793 2020 1 Texans Chiefs 2020_01_HOU_KC
6 25 2793 2020 1 Texans Chiefs 2020_01_HOU_KC
7 25 2793 2020 1 Texans Chiefs 2020_01_HOU_KC
8 25 2793 2020 1 Texans Chiefs 2020_01_HOU_KC
9 25 2793 2020 1 Texans Chiefs 2020_01_HOU_KC
10 45 2793 2020 1 Chiefs Texans 2020_01_HOU_KC
EDIT #2: The GameID from nflfastrpbp and GameID from routes.merging DO NOT match. That is why I am here for help. As seen in the above Expected Outcome, I need the GameID from nflfastrpbp to be on the data for routes.merging so that I can merge all the data from nflfastrpbp onto the routes.merging DF.
I started to write a function that used paste and got as far as 2020_0 but couldn't figure out how to grab the week (which is 01 in the above example ... but will go all the way to 17 with the full data) and then the away_team followed by the home_team ... so 2020_01_HOU_KC
EDIT #3 There is not a column to match.
I am trying to CREATE that column by recreating the game_id column in nflfastpbp within the routes.merging DF so that I can merge the two together on that newly created column.
So, I started to write this function:
testing <- function(x) {
add.column.nflfastrpbp.to.routes.merged <- paste("2020_0")
}
routes.merging$id.for.merging <- testing()
And, in the id.for.merging column in routes.merging you can see it is working:
EventID GameID Season Week OffensiveTeam DefensiveTeam id.for.merging
<int> <int> <int> <int> <chr> <chr> <chr>
1 15 2793 2020 1 Texans Chiefs 2020_0
2 15 2793 2020 1 Texans Chiefs 2020_0
3 15 2793 2020 1 Texans Chiefs 2020_0
4 15 2793 2020 1 Texans Chiefs 2020_0
5 15 2793 2020 1 Texans Chiefs 2020_0
6 25 2793 2020 1 Texans Chiefs 2020_0
7 25 2793 2020 1 Texans Chiefs 2020_0
8 25 2793 2020 1 Texans Chiefs 2020_0
9 25 2793 2020 1 Texans Chiefs 2020_0
10 45 2793 2020 1 Chiefs Texans 2020_0
# ... with 80,666 more rows
What I cannot figure out is how to finish writing that function to take all the information and correctly match the game_id from nflfastrpbp for all the unique games.
So, taking:
testing <- function(x) {
add.column.nflfastrpbp.to.routes.merged <- paste("2020_0")
}
... and finishing it so that it outputs:
2020_01_ARI_SF
or
2020_07_GB_HOU
into the newly created id.for.merging column.
To be clear:
2020 = year (not included in the data)
01 & 07 = week (included)
GB_HOU = away_team, home_towm
You don't need to write a separate function to create a new column. But if you do want to, you can do this:
testing <- function(df) {
library(dplyr)
with(df, # assumes `df` is a data structure like `routes.merging`
paste0(
"2020_",
sprintf("%02d", Week),
"_",
case_when( # away_team == "team_name" ~ "city"
OffensiveTeam == "Texans" ~ "HOU",
OffensiveTeam == "Chiefs" ~ "KC",
OffensiveTeam == "Cardinals" ~ "ARI",
OffensiveTeam == "49ers" ~ "SF",
# etc.
OffensiveTeam == "Packers" ~ "GB"
),
"_",
case_when( # home_team == "team_name" ~ "city"
DefensiveTeam == "Texans" ~ "HOU",
DefensiveTeam == "Chiefs" ~ "KC",
DefensiveTeam == "Cardinals" ~ "ARI",
DefensiveTeam == "49ers" ~ "SF",
# etc.
DefensiveTeam == "Packers" ~ "GB"
)
)
)
}
routes.merging$id.for.merging <- testing(routes.merging)
Otherwise, you can add the column directly like this:
library(dplyr)
routes.merging <- mutate(routes.merging,
id.for.merging = paste0(
"2020_",
sprintf("%02d", Week),
"_",
case_when( # away_team == "team_name" ~ "city"
OffensiveTeam == "Texans" ~ "HOU",
OffensiveTeam == "Chiefs" ~ "KC",
OffensiveTeam == "Cardinals" ~ "ARI",
OffensiveTeam == "49ers" ~ "SF",
# etc.
OffensiveTeam == "Packers" ~ "GB"
),
"_",
case_when( # home_team == "team_name" ~ "city"
DefensiveTeam == "Texans" ~ "HOU",
DefensiveTeam == "Chiefs" ~ "KC",
DefensiveTeam == "Cardinals" ~ "ARI",
DefensiveTeam == "49ers" ~ "SF",
# etc.
DefensiveTeam == "Packers" ~ "GB"
)
)
)
sprintf("%02d", Week) makes any single digit (e.g., "1") into double digits (e.g., "01"), and double digits stay double digits.
case_when() is a function in dplyr R package. The function allows you to vectorize multiple ifelse() statements. You will need to add more lines in case_when() for a complete list of the NFL teams, of course.
The output from using your reprex data structure looks like this:
# A tibble: 5 x 7
# Groups: EventID, GameID [2]
EventID GameID Season Week OffensiveTeam DefensiveTeam id.for.merging
<int> <int> <int> <int> <chr> <chr> <chr>
1 30 2805 2020 1 49ers Cardinals 2020_01_SF_ARI
2 30 2805 2020 1 49ers Cardinals 2020_01_SF_ARI
3 30 2805 2020 1 49ers Cardinals 2020_01_SF_ARI
4 30 2805 2020 1 49ers Cardinals 2020_01_SF_ARI
5 45 2805 2020 1 Cardinals 49ers 2020_01_ARI_SF
Finally, merging:
merged_data <- full_join(routes.merging, nflfastrpbp, by = c("id.for.merging" = "game_id"))
Run ?dplyr::join or ?merge to learn more about some other merge functions and options.

Transform to wide format from long in R

I have a data frame in R which looks like below
Model Month Demand Inventory
A Jan 10 20
B Feb 30 40
A Feb 40 60
I want the data frame to look
Jan Feb
A_Demand 10 40
A_Inventory 20 60
A_coverage
B_Demand 30
B_Inventory 40
B_coverage
A_coverage and B_Coverage will be calculated in excel using a formula. But the problem I need help with is to pivot the data frame from wide to long format (original format).
I tried to implement the solution from the linked duplicate but I am still having difficulty:
HD_dcast <- reshape(data,idvar = c("Model","Inventory","Demand"),
timevar = "Month", direction = "wide")
Here is a dput of my data:
data <- structure(list(Model = c("A", "B", "A"), Month = c("Jan", "Feb",
"Feb"), Demand = c(10L, 30L, 40L), Inventory = c(20L, 40L, 60L
)), class = "data.frame", row.names = c(NA, -3L))
Thanks
Here's an approach with dplyr and tidyr, two popular R packages for data manipulation:
library(dplyr)
library(tidyr)
data %>%
mutate(coverage = NA_real_) %>%
pivot_longer(-c(Model,Month), names_to = "Variable") %>%
pivot_wider(id_cols = c(Model, Variable), names_from = Month ) %>%
unite(Variable, c(Model,Variable), sep = "_")
## A tibble: 6 x 3
# Variable Jan Feb
# <chr> <dbl> <dbl>
#1 A_Demand 10 40
#2 A_Inventory 20 60
#3 A_coverage NA NA
#4 B_Demand NA 30
#5 B_Inventory NA 40
#6 B_coverage NA NA

Aggregating time-based data of multiple patients to daily averages per patient in R

I have a dataframe that looks like this:
id time value
01 2014-02-26 13:00:00 6
02 2014-02-26 15:00:00 6
01 2014-02-26 18:00:00 6
04 2014-02-26 21:00:00 7
02 2014-02-27 09:00:00 6
03 2014-02-27 12:00:00 6
The dataframe consists of a mood score at different time stamps throughout the day of multiple patients.
I want the dataframe to become like this:
id 2014-02-26 2014-02-27
01 6.25 4.32
02 5.39 8.12
03 9.23 3.18
04 5.76 3.95
With on each row a patient and in each the column the daily mean of all the days in the dataframe. If there is no mood score on a specific date from a patient, I want the value to be NA.
What is the easiest way to do so using functions like ddply, or from other packages?
df <- structure(list(id = c(1L, 2L, 1L, 4L, 2L, 3L), time = structure(c(1393437600,
1393444800, 1393455600, 1393466400, 1393509600, 1393520400), class = c("POSIXct",
"POSIXt"), tzone = ""), value = c(6L, 6L, 6L, 7L, 6L, 6L)), .Names = c("id",
"time", "value"), row.names = c(NA, -6L), class = "data.frame")
Based on your description, this seems to be what you need,
library(tidyverse)
df1 %>%
group_by(id, time1 = format(time, '%Y-%m-%d')) %>%
summarise(new = mean(value)) %>%
spread(time1, new)
#Source: local data frame [4 x 3]
#Groups: id [4]
# id `2014-02-26` `2014-02-27`
#* <int> <dbl> <dbl>
#1 1 6 NA
#2 2 6 6
#3 3 NA 6
#4 4 7 NA
In base R, you could combine aggregate with reshape like this:
# get means by id-date
temp <- setNames(aggregate(value ~ id + format(time, "%y-%m-%d"), data=df, FUN=mean),
c("id", "time", "value"))
# reshape to get dates as columns
reshape(temp, direction="wide", idvar="id", timevar="time")
id value.14-02-26 value.14-02-27
1 1 6 NA
2 2 6 6
3 4 7 NA
5 3 NA 6
I'd reccomend using the data.table package, the approach then is very similar to Sotos' tidiverse solution.
library(data.table)
df <- data.table(df)
df[, time1 := format(time, '%Y-%m-%d')]
aggregated <- df[, list(meanvalue = mean(value)), by=c("id", "time1")]
aggregated <- dcast.data.table(aggregated, id~time1, value.var="meanvalue")
aggregated
# id 2014-02-26 2014-02-27
# 1: 1 6 NA
# 2: 2 6 6
# 3: 3 NA 6
# 4: 4 NA 7
(I think my result differs, because my System runs on another timezone, I imported the datetime objects as UTC.)

Resources