This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 2 years ago.
Using R, is there a way to reformat a dataset in a way similar to doing a pivot table in excel? My data has 5 variables in total. Three variables are Date, Channel and Category, and there are two metric variables Views and Spend. Is there a way that I can generate timeseries data, whereby I have the Date in rows, and to auto-generate new variables based on permutation of Channel & Category, for each of metric variables, Views and Spend? This question is different from other questions, because I want the variable names to be part of the variable name.
The start file looks like this
Date=c("01/01/2020","01/01/2020","01/01/2020","01/01/2020","01/01/2020","08/01/2020","08/01/2020","15/01/2020","15/01/2020","15/01/2020","15/01/2020","22/01/2020","22/01/2020","22/01/2020","22/01/2020","22/01/2020","22/01/2020","22/01/2020","29/01/2020","29/01/2020","05/02/2020","05/02/2020","05/02/2020")
Channel=c("TV","TV","TV","Internet","TV","Internet","TV","Internet","TV","TV","Internet","TV","Internet","TV","TV","Internet","TV","TV","Internet","TV","Internet","TV","Internet")
Category=c("CatA","CatA","CatA","CatA","CatB","CatB","CatB","CatB","CatA","CatB","CatB","CatA","CatB","CatB","CatB","CatB","CatB","CatB","CatB","CatA","CatA","CatA","CatA")
Views=c(190,320,260,300,240,190,200,190,230,30,370,260,350,240,330,190,290,220,230,180,230,310,270)
Spend=c(34,63,46,53,21,23,17,24,20,5,50,42,46,39,44,31,72,54,58,22,29,41,36)
df <- data.frame(Date,Channel,Category,Views,Spend)
df
> df
Date Channel Category Views Spend
1 01/01/2020 TV CatA 190 34
2 01/01/2020 TV CatA 320 63
3 01/01/2020 TV CatA 260 46
4 01/01/2020 Internet CatA 300 53
5 01/01/2020 TV CatB 240 21
6 08/01/2020 Internet CatB 190 23
7 08/01/2020 TV CatB 200 17
8 15/01/2020 Internet CatB 190 24
9 15/01/2020 TV CatA 230 20
10 15/01/2020 TV CatB 30 5
11 15/01/2020 Internet CatB 370 50
12 22/01/2020 TV CatA 260 42
13 22/01/2020 Internet CatB 350 46
14 22/01/2020 TV CatB 240 39
15 22/01/2020 TV CatB 330 44
16 22/01/2020 Internet CatB 190 31
17 22/01/2020 TV CatB 290 72
18 22/01/2020 TV CatB 220 54
19 29/01/2020 Internet CatB 230 58
20 29/01/2020 TV CatA 180 22
21 05/02/2020 Internet CatA 230 29
22 05/02/2020 TV CatA 310 41
23 05/02/2020 Internet CatA 270 36
I would like the reformatted dataframe to look like this
Date=c("01/01/2020","08/01/2020","15/01/2020","22/01/2020","29/01/2020","05/02/2020")
TV.CatA.Views=c(770,0,230,260,180,310)
TV.CatB.Views=c(240,200,30,1080,0,0)
Internet.CatA.Views=c(300,0,0,0,0,500)
Internet.CatB.Views=c(0,190,560,540,230,0)
TV.CatA.Spend=c(143,0,20,42,22,41)
TV.CatB.Spend=c(21,17,5,209,0,0)
Internet.CatA.Spend=c(53,0,0,0,0,65)
Internet.CatB.Spend=c(0,23,74,77,58,0)
df_result <- data.frame(Date,TV.CatA.Views,TV.CatB.Views,Internet.CatA.Views,Internet.CatB.Views,TV.CatA.Spend,TV.CatB.Spend,Internet.CatA.Spend,Internet.CatB.Spend)
df_result
> df_result
Date TV.CatA.Views TV.CatB.Views Internet.CatA.Views Internet.CatB.Views TV.CatA.Spend
1 01/01/2020 770 240 300 0 143
2 08/01/2020 0 200 0 190 0
3 15/01/2020 230 30 0 560 20
4 22/01/2020 260 1080 0 540 42
5 29/01/2020 180 0 0 230 22
6 05/02/2020 310 0 500 0 41
TV.CatB.Spend Internet.CatA.Spend Internet.CatB.Spend
1 21 53 0
2 17 0 23
3 5 0 74
4 209 0 77
5 0 0 58
6 0 65 0
The variable names don't need to be exactly how I've specified above, just as long as it's possible to recognise what those levels are in the variable. Currently, I've been doing this in excel but after doing over 50 of them in succession, I need to find a more efficient way.
Thanks for taking time to look at my question, any help is greatly appreciated.
This code produces something similar to what you want, using df you added:
library(tidyverse)
#Code
mdf <- df %>% group_by(Date,Channel,Category) %>% summarise_all(.funs = sum) %>%
ungroup() %>% pivot_wider(names_from = c(Channel,Category),values_from = c(Views,Spend))
Output:
Date Views_Internet_CatA Views_TV_CatA Views_TV_CatB Views_Internet_CatB Spend_Internet_CatA
1 01/01/2020 300 770 240 NA 53
2 05/02/2020 500 310 NA NA 65
3 08/01/2020 NA NA 200 190 NA
4 15/01/2020 NA 230 30 560 NA
5 22/01/2020 NA 260 1080 540 NA
6 29/01/2020 NA 180 NA 230 NA
Spend_TV_CatA Spend_TV_CatB Spend_Internet_CatB
1 143 21 NA
2 41 NA NA
3 NA 17 23
4 20 5 74
5 42 209 77
6 22 NA 58
I am trying to extract the tables from different pages into one data frame. However, I am only able to get it as a list and I am unable to convert to one table. Could you please help me out?
Code we are using so far:
Tables_recent <- lapply(paste0("http://stats.espncricinfo.com/ci/engine/stats/index.html?class=3;home_or_away=1;home_or_away=2;home_or_away=3;page=",
1:50,
";template=results;type=batting"),
function(url){
url %>% read_html() %>%
html_nodes(xpath= '//*[#id="ciHomeContentlhs"]/div[3]/table[3]') %>%
html_table()
})
It's nested within a list, so you need to get out the first element, and also remove entries that are "No records available to match this query"
library(dplyr)
library(textreadr)
library(rvest)
library(dplyr)
LINK = "http://stats.espncricinfo.com/ci/engine/stats/index.html?class=3;home_or_away=1;home_or_away=2;home_or_away=3;page="
Tables_recent <- lapply(paste0(LINK, 1:50,";template=results;type=batting"), function(url){
url %>% read_html() %>%
html_nodes(xpath= '//*[#id="ciHomeContentlhs"]/div[3]/table[3]') %>%
html_table()
})
we check the number of columns for each page entry:
> sapply(Tables_recent,function(i)ncol(i[[1]]))
[1] 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16
[26] 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 1 1 1 1 1 1 1 1 1 1
Those with ncol == 1 are empty:
> Tables_recent[[50]]
[[1]]
X1
1 No records available to match this query
We loop through non-empty and then rbind
wh = which(sapply(Tables_recent,function(i)ncol(i[[1]]))==16)
Table = do.call(rbind,lapply(Tables_recent[wh],"[[",1))
head(Table)
Player Span Mat Inns NO Runs HS Ave BF SR 100
1 RG Sharma (INDIA) 2007-2019 101 93 14 2539 118 32.13 1843 137.76 4
2 V Kohli (INDIA) 2010-2019 72 67 18 2450 90* 50 1811 135.28 0
3 MJ Guptill (NZ) 2009-2019 83 80 7 2436 105 33.36 1810 134.58 2
4 Shoaib Malik (ICC/PAK) 2006-2019 111 104 30 2263 75 30.58 1824 124.06 0
5 BB McCullum (NZ) 2005-2015 71 70 10 2140 123 35.66 1571 136.21 2
6 DA Warner (AUS) 2009-2019 76 76 8 2079 100* 30.57 1476 140.85 1
50 0 4s 6s
1 18 6 225 115 NA
2 22 2 235 58 NA
3 15 2 215 113 NA
4 7 1 186 61 NA
5 13 3 199 91 NA
6 15 5 203 86 NA
What you have is a list. try
do.call(rbind,lapply(Tables_recent,function(x){x<-as.data.frame(x); if(length(x)>1)x}))
or you could do
do.call(rbind,Filter(function(x)length(x)>1,lapply(Tables_recent,as.data.frame)))
Maybe I'm phrasing this a bit wrong in the title. The idea is that I have a dataframe that looks like
Station From To PassIn PassOut
Stat1 9 16 213 123
Stat1 16 18 123 14
Stat3 6 7 884 90
Stat2 7 9 213 33
And I want to convert it to:
Station From To PassIn PassOut
Stat1 6 7 884 90
Stat2 6 7 213 33
Stat3 6 7 213 123
Stat1 7 9 884 90
Stat2 7 9 213 33
Stat3 7 9 213 123
Stat1 9 16 884 90
Stat2 9 16 213 33
Stat3 9 16 213 123
The stations cannot be ordered alphabetically, they have different names, and I want to order them based on their location. And the second argument in the sorting function should be the From column.
I know of order(), but I'm unaware of how I can make use of it given the first argument constraint here.
I would do something like:
df$Station <- factor(df$Station, levels = c("Station1","Station2","Station3"))
df$From <- as.numeric(df$From)
df[order(df$From,df$Station),]
I am trying to calculate the residuals from a random forest cross validation. I am working with the response variable "Sales" in this data set. I want to put the residuals into a support vector machine. I am using the Carseats data set in R. Here is my code so far:
set.seed (1)
library(ISLR)
data(Carseats)
head(Carseats)
Sales CompPrice Income Advertising Population Price ShelveLoc
1 9.50 138 73 11 276 120 Bad
2 11.22 111 48 16 260 83 Good
3 10.06 113 35 10 269 80 Medium
4 7.40 117 100 4 466 97 Medium
5 4.15 141 64 3 340 128 Bad
6 10.81 124 113 13 501 72 Bad
Age Education Urban US sales
1 42 17 Yes Yes Yes
2 65 10 Yes Yes Yes
3 59 12 Yes Yes Yes
4 55 14 Yes Yes Yes
5 38 13 Yes No Yes
6 78 16 No Yes Yes
##Random forest
#cross validation to pick best mtry from 3,5,10
library(randomForest)
cv.carseats = rfcv(trainx=Carseats[,-1],trainy=Carseats[,1],cv.fold=5,step=0.9)
cv.carseats
with(cv.carseats,plot(n.var,error.cv,type="o"))
#from the graph it would appear mtry=5 produces the lowest error
##SVM
library(e1071)
#cross validation to pick best gamma
tune.out=tune(svm,Sales~.,data=Carseats,gamma=c(0.01,0.1,1,10),
tunecontrol = tune.control(cross=5))
I will replace "Sales" in the SVM with the residuals from the random forest cross validation. I am having a difficult time calculating the residuals from the random forest cross validation. Any help is greatly appreciated! Thank you!
I have following data and trying change CCG and Pract to numbers so I can use stan or Winbugs...when I try to change it seems its changing the order of the data..
I want to change CCG and Pract to numbers without changing the order of the data...I tried hard but I couldn't do it.
I am struggling with this basic issue than writing Bugs codes....please help..
I have the following data
CCG pract Deno Numer Points Excep
1 01C N81049 49 46 4 4
2 01C N81022 28 26 4 23
3 01C N81632 66 64 4 4
4 01C N81069 15 14 4 3
5 01C N81062 98 89 4 9
6 01C N81033 31 28 4 9
I tried to change to integer using as.integer() and I am getting I am getting..
CCG pract Deno Numer Points Excep
1 20 6621 160 144 41 36
2 20 6594 130 117 41 18
3 20 6698 179 164 41 36
4 20 6640 57 46 41 25
5 20 6633 214 191 41 62
6 20 6605 137 119 41 62
By checking Deno and Numer it is clear the order of the data has been changed...Why CCG is not starting from 1?
I want
CCG pract Deno Numer Points Excep
1 01C N81049 49 46 4 4
2 01C N81022 28 26 4 23
3 01C N81632 66 64 4 4
4 01C N81069 15 14 4 3
5 01C N81062 98 89 4 9
6 01C N81033 31 28 4 9
change to something like this
CCG pract Deno Numer Points Excep
1 1 1 49 46 4 4
2 1 1 28 26 4 23
3 1 1 66 64 4 4
4 1 1 15 14 4 3
5 1 1 98 89 4 9
6 1 1 31 28 4 9
Please help me..
In R, factors are internally represented as integers, linking to a table of the factor levels. AFAIK, these internal integers are assigned based on a lexicographic order of the factor levels, so 57 gets a higher code than 238.
as.integer() will extract this internal integer coding. As you found out, this is not very useful. (I honestly don't understand why R does this when applying as.integer() to factors that have integers as factor levels.)
Solution: first convert to character, then to integer. as.integer(as.character(Deno))