This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 11 months ago.
I have a data frame with three columns where each row is unique:
df1
# state val_1 season
# 1 NY 3 winter
# 2 NY 10 spring
# 3 NY 24 summer
# 4 BOS 14 winter
# 5 BOS 26 spring
# 6 BOS 19 summer
# 7 WASH 99 winter
# 8 WASH 66 spring
# 9 WASH 42 summer
I want to create a matrix with the state names for rows and the seasons for columns with val_1 as the values. I have previously used:
library(reshape2)
df <- acast(df1, state ~ season, value.var='val_1')
And it has created the desired matrix with each state name appearing once but for some reason when I have been using acast or dcast recently it automatically defaults to the length function and gives 1's for the values. Can anyone recommend a solution?
data
state <- c('NY', 'NY', 'NY', 'BOS', 'BOS', 'BOS', 'WASH', 'WASH', 'WASH')
val_1 <- c(3, 10, 24, 14, 26, 19, 99, 66, 42)
season <- c('winter', 'spring', 'summer', 'winter', 'spring', 'summer',
'winter', 'spring', 'summer')
df1 <- data.frame(state, val_1, season)
You may define the fun.aggregate=.
library(reshape2)
acast(df1, state~season, value.var = 'val_1', fun.aggregate=sum)
# spring summer winter
# BOS 26 19 14
# NY 10 24 3
# WASH 66 42 99
This also works
library(reshape2)
state = c('NY', 'NY', 'NY', 'BOS', 'BOS', 'BOS', 'WASH', 'WASH', 'WASH')
val_1 = c(3, 10, 24, 14, 26, 19, 99, 66, 42)
season = c('winter', 'spring', 'summer', 'winter', 'spring', 'summer', 'winter', 'spring', 'summer')
df1 = data.frame(state,
val_1,
season)
dcast(df1, state~season, value.var = 'val_1')
#> state spring summer winter
#> 1 BOS 26 19 14
#> 2 NY 10 24 3
#> 3 WASH 66 42 99
Created on 2022-04-08 by the reprex package (v2.0.1)
Related
it's easier to explain what I want to do if you look at the code first but essentially I think I want to use lapply on a condition but I wasn't able to do it.
library("tidyverse")
names <- rep(c("City A", "City B"), each = 11)
year <- rep(c(2010:2020), times = 2)
col_1 <- c(1, 17, 34, 788, 3, 4, 78, 98, 650, 45, 20,
23, 45, 56, 877, 54, 12, 109, 167, 12, 19, 908)
col_2 <- c(3, 4, 23, 433, 2, 45, 34, 123, 98, 76, 342,
760, 123, 145, 892, 23, 5, 90, 40, 12, 67, 98)
df <- as.data.frame(cbind(names, year, col_1, col_2))
df <- df %>%
mutate(col_1 = as.numeric(col_1),
col_2 = as.numeric(col_2))
I want every numeric column in the year 2018 and later to be rounded with round_any to a value which is a multiple of three (plyr::round_any, 3)
What I tried is this:
df_2018 <- df %>%
filter(year >= 2018)
df <- df %>%
filter(!(year >= 2018))
df_2018[, c(3:4)] <- lapply(df_2018[, c(3:4)], plyr::round_any, 3)
df <- rbind(df, df_2018)
In reality, there's about 50 numeric columns and tons of rows. What I tried works in theory but I would like to achieve it with less code and cleaner.
I am new to using lapply and I failed trying to combine it with an ifelse because I don't want it to change my year column.
Thank you for everyone who takes the time out of their day to look at this :)
Using dplyr::across and if_else you could do:
library(dplyr)
df |>
mutate(across(-c(names, year), ~ if_else(year >= 2018, plyr::round_any(.x, 3), .x)))
#> names year col_1 col_2
#> 1 City A 2010 1 3
#> 2 City A 2011 17 4
#> 3 City A 2012 34 23
#> 4 City A 2013 788 433
#> 5 City A 2014 3 2
#> 6 City A 2015 4 45
#> 7 City A 2016 78 34
#> 8 City A 2017 98 123
#> 9 City A 2018 651 99
#> 10 City A 2019 45 75
#> 11 City A 2020 21 342
#> 12 City B 2010 23 760
#> 13 City B 2011 45 123
#> 14 City B 2012 56 145
#> 15 City B 2013 877 892
#> 16 City B 2014 54 23
#> 17 City B 2015 12 5
#> 18 City B 2016 109 90
#> 19 City B 2017 167 40
#> 20 City B 2018 12 12
#> 21 City B 2019 18 66
#> 22 City B 2020 909 99
Using data.table:
cols <- grep("^col_[0-9]+$", names(df), value = TRUE)
setDT(df)[year >= 2018, (cols) := round(.SD / 3) * 3, .SDcols = cols]
I have two data sets I would like to join. The income_range data is the master dataset and I would like to join data_occ to the income_range data based on what band the income falls inside. Where there are more than two observations(incomes) that are within the range I would like to take the lower income.
I was attempting to use data.table but was having trouble. I was would also like to keep all columns from both data.frames if possible.
The output dataset should only have 7 observations.
library(data.table)
library(dplyr)
income_range <- data.frame(id = "France"
,inc_lower = c(10, 21, 31, 41,51,61,71)
,inc_high = c(20, 30, 40, 50,60,70,80)
,perct = c(1,2,3,4,5,6,7))
data_occ <- data.frame(id = rep(c("France","Belgium"), each=50)
,income = sample(10:80, 50)
,occ = rep(c("manager","clerk","manual","skilled","office"), each=20))
setDT(income_range)
setDT(data_occ)
First attempt.
df2 <- income_range [data_occ ,
on = .(id, inc_lower <= income, inc_high >= income),
.(id, income, inc_lower,inc_high,perct,occ)]
Thank you in advance.
Since you tagged dplyr, here's one possible solution using that library:
library('fuzzyjoin')
# join dataframes on id == id, inc_lower <= income, inc_high >= income
joined <- income_range %>%
fuzzy_left_join(data_occ,
by = c('id' = 'id', 'inc_lower' = 'income', 'inc_high' = 'income'),
match_fun = list(`==`, `<=`, `>=`)) %>%
rename(id = id.x) %>%
select(-id.y)
# sort by income, and keep only the first row of every unique perct
result <- joined %>%
arrange(income) %>%
group_by(perct) %>%
slice(1)
And the (intermediate) results:
> head(joined)
id inc_lower inc_high perct income occ
1 France 10 20 1 10 manager
2 France 10 20 1 19 manager
3 France 10 20 1 14 manager
4 France 10 20 1 11 manager
5 France 10 20 1 17 manager
6 France 10 20 1 12 manager
> result
# A tibble: 7 x 6
# Groups: perct [7]
id inc_lower inc_high perct income occ
<chr> <dbl> <dbl> <dbl> <int> <chr>
1 France 10 20 1 10 manager
2 France 21 30 2 21 manual
3 France 31 40 3 31 manual
4 France 41 50 4 43 manager
5 France 51 60 5 51 clerk
6 France 61 70 6 61 manager
7 France 71 80 7 71 manager
I've added the intermediate dataframe joined for easy of understanding. You can omit it and just chain the two command chains together with %>%.
Here is one data.table approach:
cols = c("inc_lower", "inc_high")
data_occ[, (cols) := income]
result = data_occ[order(income)
][income_range,
on = .(id, inc_lower>=inc_lower, inc_high<=inc_high),
mult="first"]
data_occ[, (cols) := NULL]
# id income occ inc_lower inc_high perct
# 1: France 10 clerk 10 20 1
# 2: France 21 manager 21 30 2
# 3: France 31 clerk 31 40 3
# 4: France 41 clerk 41 50 4
# 5: France 51 clerk 51 60 5
# 6: France 62 manager 61 70 6
# 7: France 71 manager 71 80 7
I got the to solve the following problem:
create a dataset holding the Turnover (runif 500;1000) integer values for your 4 Sales representatives for the last 4 years each salesperson selling 4 different products (Mars, Snickers, Bounty, Milkeyway); additioanlly add a column with the integer CostofSales (runif 50;150) finally calculate the Earnings in an own column. Combine all values into a dataframe
so I did:
Years <- rep(c(2021:2018),16)
Years
Sales <- rep(c("Chris","Lucas","Cara","Bia"),16)
View(Sales)
Product <- rep(c("Mars","Snickers","Bounty","Milkway"),16)
Product
Turnover <- c(runif(64,500,1000))
Turnover
df <- data.frame(Years,Sales,Product,Turnover)
View(df)
But the 'dataframe' is messed up:
Can anyone help me? THANK YOU
Perhaps this is what you are trying for
Years <- rep(c(2021:2018), each=16)
Sales <- rep(rep(c("Chris","Lucas","Cara","Bia"), each=4), 4)
Product <- rep(c("Mars", "Snickers", "Bounty", "Milkway"), 16)
Turnover <- runif(64, 500, 1000)
df <- data.frame(Years,Sales,Product,Turnover)
df[c(4, 8, 12, 16, 20, 24, 28, 32, 36), ]
# Years Sales Product Turnover
# 4 2021 Chris Milkway 964.8695
# 8 2021 Lucas Milkway 799.1933
# 12 2021 Cara Milkway 613.6976
# 16 2021 Bia Milkway 970.3118
# 20 2020 Chris Milkway 598.2047
# 24 2020 Lucas Milkway 951.0657
# 28 2020 Cara Milkway 537.1925
# 32 2020 Bia Milkway 720.0880
# 36 2019 Chris Milkway 759.2236
I have the following data frame with data for each county:
A tibble: 6 x 4
# Groups: countyfips, day_after_reopening, deciles_income [6]
countyfips day_after_reopening deciles_income winner2016
<int> <drtn> <int> <chr>
1 1001 -109 days 8 Donald Trump
2 1001 -102 days 8 Donald Trump
3 1001 -95 days 8 Donald Trump
4 1001 -88 days 8 Donald Trump
5 1001 -81 days 8 Donald Trump
6 1001 -74 days 8 Donald Trump
And I would like to group it by the day_after_reopening column. However, the problem is that for each county the day_after_reopening number differs a little, as the observations are taken at the same time for each county, but the counties each opened on a different day of the week (e.g.out of the two counties I would like to have in the same group, one might have -109, the other -108).
How would you group these two counties with very similar numeric values together? Thank you.
You can create artificial groups based on some pre defined difference between numbers.
I created one example below:
require(dplyr)
# Difference max that you want
difference_max <- 2
# Create dummy data frame
day_after_reopening <- c(108, 109, 107, 50, 51, 68, 69, 67, 108, 109, 55, 56, 57, 100, 101, 101, 100,56)
df <- data.frame(day_after_reopening = day_after_reopening, index = seq(1:length(day_after_reopening)))
# Order the interesting column
df <- df[order(df$day_after_reopening),]
df$test <- c(diff(df$day_after_reopening, lag = 1), 0)
# Create the breaks where the difference value is greater than a selected value
breaks <- df[df$test > difference_max,]
breaks$test <- "here"
df <- rbind(breaks, df)
df <- df[order(df$day_after_reopening, df$test),]
# Create the split points and grouping
df <- df %>%
mutate(split_point = test < "here",
breaks = with(rle(split_point), rep(seq_along(lengths), lengths))) %>%
filter(split_point) %>%
group_by(breaks) %>%
summarise(day_after_reopening_mean = mean(day_after_reopening))
> df
# A tibble: 5 x 2
breaks day_after_reopening_mean
<int> <dbl>
1 1 50.5
2 3 56
3 5 68
4 7 100.
5 9 108.
Ok, then sounds like you'll first want to get the max number of days so you know how far to go out, and code for a new dataframe could be something like below (I've never used cut() so wouldn't know how to do it a bit more automatically):
df2 <- df1 %>%
mutate(day_after_grp =
case_when(day_after_reopening >=0 & day_after_reopening <=6 ~ "0-6",
day_after_reopening >=7 & day_after_reopening <=13 ~ "7-13",
day_after_reopening >=14 & day_after_reopening <=20 ~ "14-20",
etc to max. You'd then have a new variable "day_after_grp" to use for groupings.
Again, there may be a more programmatic way to do it w/ less copy/paste.
I have following dataframe,
Here I want to add a column 'Constant Vol', where if the 'Year' column is 2006 all the values for 'Constant Vol' should be that od 2006 'Vol'. The result should be like following dataframe.
Using dplyr, we can group_by Seg and get the corresponding Vol where Year = 2006
library(dplyr)
df %>%
group_by(Seg) %>%
mutate(Constnt_Vol = Vol[Year == 2006])
# Seg Year Vol Constnt_Vol
# <fct> <int> <dbl> <dbl>
#1 Agri 2006 23 23
#2 Agri 2007 29 23
#3 Agri 2008 16 23
#4 Agri 2009 31 23
#5 Auto 2006 12 12
#6 Auto 2007 34 12
#7 Auto 2008 45 12
#8 Auto 2009 32 12
and in data.table that would be
library(data.table)
setDT(df)[, Constnt_Vol := Vol[Year == 2006], Seg]
This is assuming you have only one row with Year = 2006 in each Seg, if there are multiple we can use which.max to get the first one. (Vol[which.max(Year == 2006)]).
data
df <- data.frame(Seg = rep(c("Agri", "Auto"), each =4),
Year =2006:2009, Vol = c(23, 29, 16, 31, 12, 34, 45, 32))
We can use
library(dplyr)
df %>%
group_by(Seg) %>%
mutate(Constnt_Vol = Vol[match(2006, Year)])
data
df <- data.frame(Seg = rep(c("Agri", "Auto"), each =4),
Year =2006:2009, Vol = c(23, 29, 16, 31, 12, 34, 45, 32))