I have this table:
ID Day Score
23928 Monday 75
394838 Tuesday 83
230902 Wednesday 90
329832 Thursday 40
…
and goes on, repeating day several times.
I want to transpose the day column to get this table
MONDAY Tuesday Wednesday …… Sunday
78 4343 343 433
Is there a way to do this in r ?
We can use data.table::transpose
library(data.table)
data.table::transpose(df1[-1], make.names = 'Day')
Or using base R
as.data.frame.list(with(df1, setNames(Score, Day)))
data
df1 <- structure(list(ID = c(23928L, 394838L, 230902L, 329832L),
Day = c("Monday",
"Tuesday", "Wednesday", "Thursday"), Score = c(75L, 83L, 90L,
40L)), class = "data.frame", row.names = c(NA, -4L))
Assuming your data is stored in a data.frame, you could use dplyr and tidyr:
df %>%
select(-ID) %>%
pivot_wider(names_from=Day, values_from=Score)
which returns
# A tibble: 1 x 4
Monday Tuesday Wednesday Thursday
<dbl> <dbl> <dbl> <dbl>
1 75 83 90 40
Use t and set names:
setNames(as.data.frame(t(df$Score)), df$Day)
Output
# Monday Tuesday Wednesday Thursday
# 75 83 90 40
You could use tidyr:
library(tidyr)
data <- data.frame(day = c("Monday", "Tuesday", "Wednesday", "Thursday"),
val = c(12,75,9,38) )
data %>% spread(day,val)
Result:
Monday Thursday Tuesday Wednesday
12 38 75 9
Related
I have two datasets, and I need to merge them by the ID value. The problems are:
The ID value can be repeated across the same dataset (no other unique value is available).
The two datasets are not equal in the rows number or the column numbers.
Example:
df1
ID
Gender
99
Male
85
Female
7
Male
df2
ID
Body_Temperature
Body_Temperature_date_time
99
36
1/1/2020 12:00 am
99
38
2/1/2020 10:30 am
99
37
1/1/2020 06:41 am
52
38
1/2/2020 11:00 am
11
39
4/5/2020 09:09 pm
7
35
9/8/2020 02:30 am
How can I turn these two datasets into one single dataset in a way that allows me to apply some machine learning models on it later on?
Depending on your expected results, if you are wanting to return all rows from each dataframe, then you can use a full_join from dplyr:
library(dplyr)
full_join(df2, df1, by = "ID")
Or with base R:
merge(x=df2,y=df1,by="ID",all=TRUE)
Output
ID Body_Temperature Body_Temperature_date_time Gender
1 99 36 1/1/2020 12:00 am Male
2 99 38 2/1/2020 10:30 am Male
3 99 37 1/1/2020 06:41 am Male
4 52 38 1/2/2020 11:00 am <NA>
5 11 39 4/5/2020 09:09 pm <NA>
6 7 35 9/8/2020 02:30 am Male
7 85 NA <NA> Female
If you have more than 2 dataframes to combine, which only overlap with the ID column, then you can use reduce on a dataframe list (so put all the dataframes that you want to combine into a list):
library(tidyverse)
df_list <- list(df1, df2)
multi_full <- reduce(df_list, function(x, y, ...)
full_join(x, y, by = "ID", ...))
Or Reduce with base R:
df_list <- list(df1, df2)
multi_full <- Reduce(function(x, y, ...)
merge(x, y, by = "ID", all = TRUE, ...), df_list)
Data
df1 <- structure(list(ID = c(99L, 85L, 7L), Gender = c("Male", "Female",
"Male")), class = "data.frame", row.names = c(NA, -3L))
df2 <- structure(list(ID = c(99L, 99L, 99L, 52L, 11L, 7L), Body_Temperature = c(36L,
38L, 37L, 38L, 39L, 35L), Body_Temperature_date_time = c("1/1/2020 12:00 am",
"2/1/2020 10:30 am", "1/1/2020 06:41 am", "1/2/2020 11:00 am",
"4/5/2020 09:09 pm", "9/8/2020 02:30 am")), class = "data.frame", row.names = c(NA,
-6L))
Objective:
I have a dataset, df, that I wish to first tally up the number of occurrences for each date and then multiply the output by a certain number.
Sent Duration Length
1/7/2020 8:11:00 PM 34 216
1/22/2020 7:51:05 AM 432 111
1/7/2020 1:35:08 AM 57 90
1/22/2020 3:43:26 AM 22 212
1/22/2020 4:00:00 AM 55 500
Desired Outcome:
Date Count Aggregation(80)
1/7/2020 2 160
1/22/2020 3 240
I wish to count the number of times a particular 'datetime' occurs and then multiply this outcome by 80. The date, 1/7/2020 occurs twice, and the date of 1/22/2020, occurs three times. I am then multiplying this number count by the number 80.
The dput is:
structure(list(Sent = structure(c(5L, 3L, 4L, 1L, 2L), .Label = c("1/22/2020 3:43:26 AM",
"1/22/2020 4:00:00 AM", "1/22/2020 7:51:05 PM", "1/7/2020 1:35:08 AM",
"1/7/2020 8:11:00 PM"), class = "factor"), Duration = c(34L,
432L, 57L, 22L, 55L), length = c(216L, 111L, 90L, 212L, 500L)), class = "data.frame", row.names = c(NA,
-5L))
This is what I have tried:
df1<- aggregate(df$Sent, by=list(Category= df$dSent),
FUN=length)
However, I need to output the frequency that the dates occurs along with the aggregation (multiply by 80)
Any suggestions are welcome.
We can convert Sent to POSIXct format and extract the date, count the number of rows in each date and multiply it by 80. Using dplyr, we can do it as :
library(dplyr)
df %>%
group_by(Date = as.Date(lubridate::mdy_hms(Sent))) %>%
summarise(Count = n(), `Aggregation(80)` = Count * 80)
# Date Count `Aggregation(80)`
# <date> <int> <dbl>
#1 2020-01-07 2 160
#2 2020-01-22 3 240
Using table.
as.data.frame(cbind(Count=(r <- table(as.Date(df$Sent, format="%m/%d/%Y %H:%M:%S"))),
Agg=r*80))
# Count Agg
# 2020-01-07 2 160
# 2020-01-22 3 240
or
`rownames<-`(as.data.frame(cbind(Count=(r <- table(as.Date(df$Sent, format="%m/%d/%Y %H:%M:%S"))),
Agg=r*80, Date=names(r)))[c(3, 1:2)], NULL)
# Date Count Agg
# 1 2020-01-07 2 160
# 2 2020-01-22 3 240
Here is the data.table way of things..
code
library( data.table )
#set data as data.table
setDT(mydata)
#set timestamps as posix
mydata[, Sent := as.POSIXct( Sent, format = "%m/%d/%Y %H:%M:%S %p" ) ]
#summarise
mydata[, .(Count = .N, Aggregation = .N * 80), by = .(Date = as.Date(Sent) )]
output
# Date Count Aggregation
# 1: 2020-01-07 2 160
# 2: 2020-01-22 3 240
Sample Dataset
Date Playerid Revenue Promo DayofWeek
01/01/2017 146123 0 B Sunday
01/01/2017 219378 0 B Sunday
01/01/2017 198614 0 B Sunday
02/01/2017 292640 30 A Monday
02/01/2017 139562 10 A Monday
02/01/2017 124967 20 A Monday
02/01/2017 107954 20 A Monday
03/01/2017 28391 10 B Tuesday
03/01/2017 184388 21 B Tuesday
03/01/2017 264222 20 B Tuesday
03/01/2017 184857 0 B Tuesday
04/01/2017 79788 40 A Wednesday
I wanted to Aggregate the table by DayofWeek, and sum up the revenue for each day of the week, count the number of players using the playerid such that my final output looks like this:
Players Revenue Promo DayofWeek
3 0 B Sunday
4 80 A Monday
4 51 B Tuesday
1 40 A Wednesday
I have been trying to aggregate the dataset attached above but all attempts were unsuccessful. Can you help, please?
Here is my code below.
aggdata <-aggregate(MyData, by=list(DayofWeek,Revenue, Promo, Playerid),
FUN=sum, na.rm=TRUE)
I got the following errors
Error in FUN(X[[i]], ...) : invalid 'type' (character) of argument
dplyr approach
library(dplyr)
ans <- df %>%
group_by(DayofWeek) %>%
summarise(Promo=unique(Promo), Revenue=sum(Revenue), Playerid=n())
Output
DayofWeek Promo Revenue Playerid
<chr> <chr> <int> <int>
1 Monday A 80 4
2 Sunday B 0 3
3 Tuesday B 51 4
4 Wednesday A 40 1
Data
df <- structure(list(Date = c("01/01/2017", "01/01/2017", "01/01/2017",
"02/01/2017", "02/01/2017", "02/01/2017", "02/01/2017", "03/01/2017",
"03/01/2017", "03/01/2017", "03/01/2017", "04/01/2017"), Playerid = c(146123L,
219378L, 198614L, 292640L, 139562L, 124967L, 107954L, 28391L,
184388L, 264222L, 184857L, 79788L), Revenue = c(0L, 0L, 0L, 30L,
10L, 20L, 20L, 10L, 21L, 20L, 0L, 40L), Promo = c("B", "B", "B",
"A", "A", "A", "A", "B", "B", "B", "B", "A"), DayofWeek = c("Sunday",
"Sunday", "Sunday", "Monday", "Monday", "Monday", "Monday", "Tuesday",
"Tuesday", "Tuesday", "Tuesday", "Wednesday")), .Names = c("Date",
"Playerid", "Revenue", "Promo", "DayofWeek"), row.names = c(NA,
-12L), class = c("data.table", "data.frame"))
It's because you're aggregating by everything except Date, so the sum function is trying to add up those date strings. Try summing revenues like this:
aggdata <-aggregate(MyData, by=list(DayofWeek, Date, Promo, Playerid),
FUN=sum, na.rm=TRUE)
Or, from what you're saying, you want to forget about dates:
aggdata <-aggregate(. ~ Dayofweek + Promo + Playerid, data = MyData[,-2:5], sum)
I've got a data frame that looks like something along these lines:
Day Salesperson Value
==== ============ =====
Monday John 40
Monday Sarah 50
Tuesday John 60
Tuesday Sarah 30
Wednesday John 50
Wednesday Sarah 40
I want to divide the value for each salesperson by the number of times that each of the days of the week has occurred. So: There have been 3 Monday, 3 Tuesdays, and 2 Wednesdays — I don't have this information digitally, but can create a vector along the lines of
c(3, 3, 2)
How can I conditionally divide the Value column based on the number of times each day occurs?
I've found an inelegant solution, which entails copying the Day column to a temp column, replacing each of the names of the week in the new column with the number of times each day occurs using
df$temp <- sub("Monday, 3, df$temp)
but doing this seems kinda clunky. Is there a neat way to do this?
Suppose your auxiliary data is in another data.frame:
Day N_Day
1 Monday 3
2 Tuesday 3
3 Wednesday 2
The simplest way would be to merge:
DF_new <- merge(DF, DF2, by="Day")
DF_new$newcol <- DF_new$Value / DF_new$N_Day
which gives
Day Salesperson Value N_Day newcol
1 Monday John 40 3 13.33333
2 Monday Sarah 50 3 16.66667
3 Tuesday John 60 3 20.00000
4 Tuesday Sarah 30 3 10.00000
5 Wednesday John 50 2 25.00000
6 Wednesday Sarah 40 2 20.00000
The mergeless shortcut is
DF$newcol <- DF$Value / DF2$N_Day[match(DF$Day, DF2$Day)]
Data:
DF <- structure(list(Day = structure(c(1L, 1L, 2L, 2L, 3L, 3L), .Label =
c("Monday",
"Tuesday", "Wednesday"), class = "factor"), Salesperson = structure(c(1L,
2L, 1L, 2L, 1L, 2L), .Label = c("John", "Sarah"), class = "factor"),
Value = c(40L, 50L, 60L, 30L, 50L, 40L)), .Names = c("Day",
"Salesperson", "Value"), class = "data.frame", row.names = c(NA,
-6L))
DF2 <- structure(list(Day = structure(1:3, .Label = c("Monday", "Tuesday",
"Wednesday"), class = "factor"), N_Day = c(3, 3, 2)), .Names = c("Day",
"N_Day"), row.names = c(NA, -3L), class = "data.frame")
You can use the library dplyr to merge your data frame with the frequency of each day.
df <- data.frame(
Day=c("Monday","Monday","Tuesday","Tuesday","Wednesday","Wednesday"),
Salesperson=c("John","Sarah","John","Sarah","John","Sarah"),
Value=c(40,50,60,30,50,40), stringsAsFactors=F)
aux <- data.frame(
Day=c("Monday","Tuesday","Wednesday"),
freq=c(3,3,2)
)
output <- df %>% left_join(aux, by="Day") %>% mutate(Value2=Value/n)
To create this auxiliary table with the count of days that appear in your original data instead of doing it manually. You could use:
aux <- df %>% group_by(Day) %>% summarise(n=n())
> output
Day Salesperson Value n Value2
1 Monday John 40 2 20
2 Monday Sarah 50 2 25
3 Tuesday John 60 2 30
4 Tuesday Sarah 30 2 15
5 Wednesday John 50 2 25
6 Wednesday Sarah 40 2 20
If you want to substitute the actual valuecolumn, then use mutate(Value=Value/n) and to remove the additional columns, you can add a select(-n)
output <- df %>% left_join(aux, by="Day") %>% mutate(Value=Value/n) %>% select(-n)
I am looking to do something in R that seems similar to what I would use the reshape package for, but not quite. I am looking to move some rows of a data frame into columns but not all. For example, my data frame looks something like:
v1, v2, v3
info, time, 12:00
info, day, Monday
info, temperature, 70
data, 1, 2
data, 2, 2
data, 3, 1
data, 4, 1
data, 5, 3
I would like to transform it into something like:
v1, v2, v3, info_time, info_day, info_temperature
data, 1, 2, 12:00, Monday, 70
data, 2, 2, 12:00, Monday, 70
data, 3, 1, 12:00, Monday, 70
data, 4, 1, 12:00, Monday, 70
data, 5, 3, 12:00. Monday, 70
Is there an easy way to do this? Does the reshape package help here?
Thank you in advance for all your help!
Vincent
Try
library(reshape2)
indx <- df$v1=='data'
res <- cbind(df[indx,],dcast(df[!indx,],v1~v2, value.var='v3'))[,-4]
row.names(res) <- NULL
colnames(res)[4:6] <- paste('info', colnames(res)[4:6], sep="_")
res
# v1 v2 v3 info_day info_temperature info_time
#1 data 1 2 Monday 70 12:00
#2 data 2 2 Monday 70 12:00
#3 data 3 1 Monday 70 12:00
#4 data 4 1 Monday 70 12:00
#5 data 5 3 Monday 70 12:00
Or use dplyr/tidyr
library(dplyr)
library(tidyr)
cbind(df[indx,],
unite(df[!indx,], Var, v1, v2) %>%
mutate(id=1) %>%
spread(Var, v3)%>%
select(-id))
Or using base R
cbind(df[indx,],
reshape(transform(df[!indx,], v2= paste(v1, v2, sep="_")),
idvar='v1', timevar='v2', direction='wide')[,-1])
data
df <- structure(list(v1 = c("info", "info", "info", "data", "data",
"data", "data", "data"), v2 = c("time", "day", "temperature",
"1", "2", "3", "4", "5"), v3 = c("12:00", "Monday", "70", "2",
"2", "1", "1", "3")), .Names = c("v1", "v2", "v3"), class = "data.frame",
row.names = c(NA, -8L))
A solution without external packages ( using df structure from Akrun):
df1 <- cbind(df[4:8,1:3],apply(df[1:3,3,drop=FALSE],1,function(x) rep(x,nrow(df)-3)))
colnames(df1)[4:6] <- paste("info",df[1:3,2], sep = "_")
df1
> df1
v1 v2 v3 info_time info_day info_temperature
4 data 1 2 12:00 Monday 70
5 data 2 2 12:00 Monday 70
6 data 3 1 12:00 Monday 70
7 data 4 1 12:00 Monday 70
8 data 5 3 12:00 Monday 70