a complex merge in R to flag unmatched observations? - r

I'm trying to join two datasets together. Call them x and y. I believe that the ID variables in y are a subset of the ID variables in x. But not in the pure sense because I know that x contains more IDs than y but I don't know the mapping. That is, some (but not all) of the IDs in x and y can be matched 1:1.
My ultimate goal is to figure out where this 1:1 mapping fails and flag these observations. I thought merge would be the way to go but maybe not. An example is below:
id <- c(1:10, 1:100)
X1 <- rnorm(110, mean = 0, sd = 1)
year <- c("2004","2005","2006","2001","2002")
year <- rep(year, 22)
month = c("Jul","Aug","Sep","Oct","Nov","Dec","Jan","Feb","Mar","Apr")
month <- rep(month, 11)
#dataset X
x <- cbind(id, X1, month, year)
#dataset Y
id2 <- c(1:10, 200)
Y1 <- rnorm(11, mean = 0 , sd = 1)
y <- cbind(id2,Y1)
#merge on the IDs; but we get an error because when id2 == 200 in y we don't
#have a match in x
result <- merge(x, y, by.x="id", by.y = "id2", all =TRUE)
The merge threw an error because id2 == 200 had no match in the x dataset. Unfortunately, I lost the ID and all the information as well! (it should equal 200 in row 111):
tail(result)
id X1 month year Y1
106 95 -0.0748386054887876 Nov 2002 NA
107 96 0.196765325477989 Dec 2004 NA
108 97 0.527922135906927 Jan 2005 NA
109 98 0.197927230533413 Feb 2006 NA
110 99 -0.00720474886698309 Mar 2001 NA
111 <NA> <NA> <NA> <NA> -0.9664941
What's more, I get duplicate observations on the ID variable in the merged file. The id2 == 1 observation only existed once but it just copied it twice (e.g. Y1 takes on the value 1.55 twice).
head(result)
id X1 month year Y1
1 1 -0.67371266313441 Jul 2004 1.553220
2 1 -0.318666983469993 Jul 2004 1.553220
3 10 -0.608192898092431 Apr 2002 1.234325
4 10 -0.72299929212347 Apr 2002 1.234325
5 100 -0.842111221826554 Apr 2002 NA
6 11 -0.16316681842082 Jul 2004 NA
This merge has made things more complicated than I intended. I was hoping I could examine every observation in x and figure out where the id matched id2 in y and flag the ones that didn't. So I would get a new vector, call it flag, that takes on a value 1 if x$id had a match in y$id2 and zero otherwise. This way, I could know where the 1:1 mapping failed. I could potentially get some traction on this by re-coding the NAs, but what about the error that gets thrown when id2 == 200? It just discards the information.
I have tried appending by rows with no luck and it looks like I should give up merge as well, perhaps it's better to wring a loop or function to do something along these lines:
for every observation in x
id2 = which(id2) corresponds to id-month-year
flag = 1 if length of above is == 1, 0 otherwise
etc.
Hopefully this all makes sense. I'd be very grateful for any help or guidance.

If you are looking for which things in x$id are in y$id2, then you can use
x$id %in% y$id2
to get a logical vector returning matches. It does not guarantee a 1-to-1 correspondence, however; just a 1-to-many. You can then add this vector to your data frame
x$match.y <- x$id %in% y$id2
to see what rows of x have a corresponding ID in y.
To see which observations are 1-to-1, you could do something like
y$id2[duplicated(y$id2)] #vector of duplicate elements in y$id2
(x$id %in% y$id2) & !(x$id %in% y$id2[duplicated(y$id2)])
to filter out elements that appear more than once in y$id2. You can also add this to x:
x$match.y.unique <- (x$id %in% y$id2) & !(x$id %in% y$id2[duplicated(y$id2)])
The same procedure can be done for y to determine what rows of y match in x, and which ones match uniquely.

The reason your merge failed was that you gave it two different structures (one a numeric matrix and the other a character matrix) for x and y. Using cbind when data.frame should be chosen is a common strategy for failure.
> str(x)
chr [1:110, 1:4] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "1" "2" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:4] "id" "X1" "month" "year"
> str(y)
num [1:11, 1:2] 1 2 3 4 5 6 7 8 9 10 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "id2" "Y1"
If you used the data.frame function (since dataframes are what merge is supposed to be working with) it would have succeeded:
> x <- data.frame(id, X1, month, year); y <- data.frame(id2,Y1)
> str( result <- merge(x, y, by.x="id", by.y = "id2", all =TRUE) )
'data.frame': 111 obs. of 5 variables:
$ id : num 1 1 2 2 3 3 4 4 5 5 ...
$ X1 : num 1.5063 2.5035 0.7889 -0.4907 -0.0446 ...
$ month: Factor w/ 10 levels "Apr","Aug","Dec",..: 6 6 2 2 10 10 9 9 8 8 ...
$ year : Factor w/ 5 levels "2001","2002",..: 3 3 4 4 5 5 1 1 2 2 ...
$ Y1 : num 1.449 1.449 -0.134 -0.134 -0.828 ...
> tail( result <- merge(x, y, by.x="id", by.y = "id2", all =TRUE) )
id X1 month year Y1
106 96 -0.3869157 Dec 2004 NA
107 97 0.6373009 Jan 2005 NA
108 98 -0.7735626 Feb 2006 NA
109 99 -1.3537915 Mar 2001 NA
110 100 0.2626190 Apr 2002 NA
111 200 NA <NA> <NA> -1.509818
If you have duplicates in your 'x' argument, then you should get duplicates in the result. It's then your responsibility to use !duplicated in whatever manner you deem appropriate (either before or after the merge), but you cannot expect merge to be making decisions like that for you.

Related

Displaying answers on ranking question in R

I have the following variables which are the result of one ranking question. On that question, participants got the 7 listed motivations presented and should rank them. Here, value 1 means the participant put the motivation on position 1, and value 7 means he put it on last position. The ranking is expressed through the numbers on these variables (numbers 1 to 7):
'data.frame': 25 obs. of 8 variables:
$ id : num 8 9 10 11 12 13 14 15 16 17 ...
$ motivation_quantity : num NA 3 1 NA 3 NA NA NA 1 NA ...
$ motivation_quality : num NA 1 6 NA 3 NA NA NA 3 NA ...
$ motivation_timesaving : num NA 6 4 NA 2 NA NA NA 5 NA ...
$ motivation_contribution : num NA 4 2 NA 1 NA NA NA 2 NA ...
$ motivation_alternativelms: num NA 5 3 NA 6 NA NA NA 7 NA ...
$ motivation_inspiration : num NA 2 7 NA 4 NA NA NA 4 NA ...
$ motivation_budget : num NA 7 5 NA 7 NA NA NA 6 NA ...
What I want to do now is to calculate and visualize the results on the ranking question (i.e. visualizing the results on the motivations). Since I havent worked with R for a long time, I am not sure how to best do this.
One way I could imagine is to first calculate the top 3 answers (which are the motivations which were most frequently ranked on position "1", "2", and "3" across participants.
Would really appreciate it if someone could help out with doing this or even show a better way how to analyse and visualize my data.
I originally had an visualization in microsoft forms but this one got destroyed by a bug overnight. It looked like this:
These variables are defined by RStudio as numeric (in statistics terms it refers to continuous variables). The goal is then to convert them into categorical variables (called factors in RStudio).
Let's get to work :
library(dplyr)
library(tidyr)
# lets us first convert the id column into integers so we can apply mutate_if on the other numeric factors and convert all of them into factors (categorical variables), we shall name your dataframe (df)
df$id <- as.integer(df$id)
# and now let's apply mutate_if to convert all the other variables (numeric) into factors (categorical variables).
df <- df %>% mutate_if(is.numeric,factor,
levels = 1:7)
# I guess in your case that would be all, but if you wanted the content of the dataframe to be position_1, position_2 ...position_7, we just add labels like this :
df <- df %>% mutate_if(is.numeric,factor,
levels = 1:7,
labels = paste(rep("position",7),1:7,sep="_"))
# For the visualisation now, we need to use the function gather in order to convert the df dataframe into a two column dataframe (and keeping the id column), we shall name this new dataframe df1
df1 <- df %>% gather(key=Questions, value=Answers, motivation_quantity:motivation_budget,-id )
# the df1 dataframe now includes three columns : the id column - the Questions columns - the Answers column.
# we can now apply the ggplot function on the new dataframe for the visualisation
# first the colours
colours <- c("firebrick4","firebrick3", "firebrick1", "gray70", "blue", "blue3" ,"darkblue")
# ATTENTION since there are NAs in your dataframe, either you can recode them as zeros or delete them (for the visualisation) using the subset function within the ggplot function as follows :
ggplot(subset(df1,!is.na(Answers)))+
aes(x=Questions,fill=Answers)+
geom_bar()+
coord_flip()+
scale_fill_manual(values = colours) +
ylab("position_levels")
# of course you can enter many modifications into the visualisation but in total I think that's what you need.

Changing class from character to integer but retaining all the datas inside

q7 <- dbGetQuery(conn,
"SELECT TailNum AS TailNum, AVG(ontime.DepDelay) AS avg_delay, ontime.Year AS Year, planes.Year AS yearmade
FROM planes JOIN ontime USING(tailnum)
WHERE ontime.Cancelled = 0 AND planes.Year != '' AND planes.Year != 'None' AND ontime.Diverted = 0 AND ontime.DepDelay > 0
GROUP BY TailNum
ORDER BY avg_delay")
Codes that i have tried:
q7 <- data.frame(
yearmade = q7.yearmade, stringsAsFactors = FALSE)
^ Dataframe
Hi! Basically I would like to create a new column where the Year would subtract the yearmade and be placed into a new column, but before I could do that, I found out that the data I draw from another table into this dataframe shows as character(yearmade), is there any way to change it but retain the original data inside?
First use as.numeric() to change yearmade into a numeric variable. Then you can simply compute the difference between Year and yearmade.
I believe this will work for you.
set.seed(1)
Year <- 2000:2022
yearmade <- sample(c('2000', '1999', '1998'), length(Year), replace = TRUE)
TailNum <- sample(c('N3738B', 'N3737C', 'N37342'), length(Year), replace = TRUE)
avg_delay <- 1:length(Year)
q7 <- data.frame(TailNum, avg_delay, Year, yearmade)
# compute difference and add to data frame
q7$year_diff <- q7$Year - as.numeric(q7$yearmade)
This retains the original data, but introduces a new column year_diff.
> str(q7)
'data.frame': 23 obs. of 5 variables:
$ TailNum : chr "N3738B" "N3738B" "N3737C" "N3738B" ...
$ avg_delay: int 1 2 3 4 5 6 7 8 9 10 ...
$ Year : int 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ...
$ yearmade : chr "2000" "1998" "2000" "1999" ...
$ year_diff: num 0 3 2 4 4 7 8 8 9 11 ...

Making multiple named data frames with loop

In the process of learning. Didn't ask my first question well, so I'm trying again and doing my best to be more clear.
I'm trying to create a series of data frames for a reproducible question for my larger issue. I would like to make 4 data frames, each named differently by the year. Eventually I will merge these four data frames to explain where I am encountering my issue.
Here is the most recent solution. This runs, but instead creates a list of four data frames without any frames in the global directory.
datafrom <- list()
years <- c(2006,2008,2010,2012)
for (i in 1:length(years)) {
UniqueID <- 1:10 # <- Not all numeric - Kept as character vector
Name <- LETTERS[seq( from = 1, to = 10 )]
Entity_Type <- factor("This","That")
Data1 <- rnorm(10)
Data2 <- rnorm(10)
Data3 <- rnorm(10)
Data4 <- rnorm(10)
Year <- years[i]
datafrom[[i]] <- data.frame(UniqueID, Name, Entity_Type, Data1, Data2, Data3, Data4, Year)
}
I would like 4 separate data frames, each named datafrom2006, datafrom2008, etc.
Many thanks in advance for your patience with my learning.
I'll demonstrate a few (of many) techniques here, and I'll call them (1) brute force, (2) list-based, and (3) single long-form data.frame.
I'll add to the example the use of a function that you want to apply to each data.frame. Though contrived, it helps makes the point:
## some constants used throughout
years <- c(2006, 2008, 2010, 2012)
n <- 10
myfunc <- function(x) {
interestingPart <- x[ , grepl('^Data', colnames(x)) ]
sapply(interestingPart, mean)
}
Brute Force
Yes, you can create multiple like-named and same-structure data.frames from a loop, though it is typically frowned upon by many experienced (R?) programmers:
set.seed(42)
for (yr in years) {
tmpdf <- data.frame(UniqueID=as.character(1:n),
Name=LETTERS[1:n],
Entity_Type=factor(c('this', 'that')),
Data1=rnorm(n),
Data2=rnorm(n),
Data3=rnorm(n),
Data4=rnorm(n),
Year=yr)
assign(sprintf('datafrom%s', yr), tmpdf)
}
rm(yr, tmpdf)
ls()
## [1] "datafrom2006" "datafrom2008" "datafrom2010" "datafrom2012" "myfunc"
## [6] "n" "years"
head(datafrom2006, n=2)
## UniqueID Name Entity_Type Data1 Data2 Data3 Data4 Year
## 1 1 A this 1.3709584 1.3048697 -0.3066386 0.4554501 2006
## 2 2 B that -0.5646982 2.2866454 -1.7813084 0.7048373 2006
In order to see the results for each data.frame, one would typically (though not always) do something like this:
myfunc(datafrom2006)
## Data1 Data2 Data3 Data4
## 0.5472968 -0.1634567 -0.1780795 -0.3639041
myfunc(datafrom2008)
## Data1 Data2 Data3 Data4
## -0.02021535 0.01839391 0.53907680 -0.21787537
myfunc(datafrom2010)
## Data1 Data2 Data3 Data4
## 0.25110630 -0.08719458 0.22924781 -0.19857243
myfunc(datafrom2012)
## Data1 Data2 Data3 Data4
## -0.7949660 0.2102418 -0.2022066 -0.2458678
List-Based
set.seed(42)
datafrom <- sapply(as.character(years), function(yr) {
data.frame(UniqueID=as.character(1:n),
Name=LETTERS[1:n],
Entity_Type=factor(c('this', 'that')),
Data1=rnorm(n),
Data2=rnorm(n),
Data3=rnorm(n),
Data4=rnorm(n),
Year=yr)
}, simplify=FALSE)
str(datafrom)
## List of 4
## $ 2006:'data.frame': 10 obs. of 8 variables:
## ..$ UniqueID : Factor w/ 10 levels "1","10","2","3",..: 1 3 4 5 6 7 8 9 10 2
## ..$ Name : Factor w/ 10 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10
## ..$ Entity_Type: Factor w/ 2 levels "that","this": 2 1 2 1 2 1 2 1 2 1
## ..$ Data1 : num [1:10] 1.371 -0.565 0.363 0.633 0.404 ...
## ..$ Data2 : num [1:10] 1.305 2.287 -1.389 -0.279 -0.133 ...
## ..$ Data3 : num [1:10] -0.307 -1.781 -0.172 1.215 1.895 ...
## ..$ Data4 : num [1:10] 0.455 0.705 1.035 -0.609 0.505 ...
## ..$ Year : Factor w/ 1 level "2006": 1 1 1 1 1 1 1 1 1 1
## $ 2008:'data.frame': 10 obs. of 8 variables:
## ..$ UniqueID : Factor w/ 10 levels "1","10","2","3",..: 1 3 4 5 6 7 8 9 10 2
#### ...snip...
head(datafrom[[1]], n=2)
## UniqueID Name Entity_Type Data1 Data2 Data3 Data4 Year
## 1 1 A this 1.3709584 1.3048697 -0.3066386 0.4554501 2006
## 2 2 B that -0.5646982 2.2866454 -1.7813084 0.7048373 2006
head(datafrom[['2008']], n=2)
## UniqueID Name Entity_Type Data1 Data2 Data3 Data4 Year
## 1 1 A this 0.2059986 0.32192527 -0.3672346 -1.04311894 2008
## 2 2 B that -0.3610573 -0.78383894 0.1852306 -0.09018639 2008
However, with this you can test your function performance with just one:
myfunc(datafrom[[1]])
myfunc(datafrom[['2010']])
and then run the function on all of them very simply:
lapply(datafrom, myfunc)
## $`2006`
## Data1 Data2 Data3 Data4
## 0.5472968 -0.1634567 -0.1780795 -0.3639041
## $`2008`
## Data1 Data2 Data3 Data4
## -0.02021535 0.01839391 0.53907680 -0.21787537
## $`2010`
## Data1 Data2 Data3 Data4
## 0.25110630 -0.08719458 0.22924781 -0.19857243
## $`2012`
## Data1 Data2 Data3 Data4
## -0.7949660 0.2102418 -0.2022066 -0.2458678
Long-form Data
If instead you keep all of the data in the same data.frame, using your already-defined column of Year, you can still segment it for exploring individual years:
longdf <- do.call('rbind.data.frame', datafrom)
rownames(longdf) <- NULL
longdf[c(1,11,21,31),]
## UniqueID Name Entity_Type Data1 Data2 Data3 Data4 Year
## 1 1 A this 1.3709584 1.3048697 -0.3066386 0.45545012 2006
## 11 1 A this 0.2059986 0.3219253 -0.3672346 -1.04311894 2008
## 21 1 A this 1.5127070 1.3921164 1.2009654 -0.02509255 2010
## 31 1 A this -1.4936251 0.5676206 -0.0861073 -0.04069848 2012
Simple subsets:
subset(longdf, Year == 2006), though subset has its goods and others.
by(longdf, longdf$Year, myfunc)
If using library(dplyr), try longdf %>% filter(Year == 2010) %>% myfunc()
(Side note: when trying to plot aggregate data, it's often easier when the data is in this form, especially when using ggplot2-like layering and aesthetics.)
Rationale Against "Brute Force"
In answer to your comment question, when making different variables with the same structure, it is easy to deduce that you will be doing the same thing to each of them, in turn or immediately-consecutively. In general programming principle, many try to generalize what they do so that it if it can be done once, it can be done an arbitrary number of times without (heavily) adjusting the code. For instance, compare what was necessary in applying myfunc in the two examples above.
Further, if you later want to aggregate the results from your calls to myfunc, it is more laborious in the "brute force" example (as you must capture each return and combine manually), whereas the other two techniques can use simpler summarizing functions (e.g., another lapply, or perhaps Reduce or Filter).

How can I replace hyphen "cells" in R data frames with zeros?

I have a data frame with some positive numbers, some negative numbers, some words, and some hyphen "cells" in it, as such:
Revenue 73.88 74.76 78.02 78.19 68.74
Other Revenue - Total - - - - -
Total Revenue 73.88 74.76 78.02 78.19 68.74
Cost of Revenue - Total 21.09 21.61 23.01 22.76 19.99
Gross Profit 52.80 -53.15 -55.01 55.43 48.75
I want to replace the hyphens that are only found in the second to last columns with 0s, but only if the hyphens are not at the beginning of numbers. For example, I don't want to turn a negative number positive.
I've tried:
df[-1] <- lapply(df[-1], function(x) as.numeric(gsub("-", 0, x)))
but that returns the previous data frame as:
Revenue NA NA NA NA NA
Other Revenue - Total 0 0 0 0 0
Total Revenue NA NA NA NA NA
Cost of Revenue - Total NA NA NA NA NA
Gross Profit NA NA NA NA NA
which is something I definitely don't want. How can I fix this?
Thanks.
This is the output when I call str():
str(income)
'data.frame': 49 obs. of 6 variables:
$ Items : Factor w/ 49 levels "Accounting Change",..: 44 40 47 7 23 45 43 9 29 49 ...
$ Recent1: Factor w/ 14 levels "-","0.00","11,305.00",..: 4 1 4 11 14 6 5 1 1 1 ...
$ Recent2: Factor w/ 16 levels "-","-29.00","0.00",..: 5 1 5 15 16 9 6 1 1 2 ...
$ Recent3: Factor w/ 17 levels "-","0.00","11,449.00",..: 5 1 5 15 17 10 6 1 1 4 ...
$ Recent4: Factor w/ 18 levels "-","-31.00","0.00",..: 6 1 6 15 17 9 4 1 1 18 ...
$ Recent5: Factor w/ 14 levels "-","0.00","1,617.00",..: 4 1 4 10 13 5 3 1 1 1 ...
As #Joe hinted at, the values in a column of a data.frame have to be of the same type, so given that you have -s in the same vectors as what appear to be numerics (52.80, 21.09, etc...), each column is being forced to type character (presumably). Try gsubbing with "0" instead of 0 and then converting the columns to numeric. Since you are forcing a 0 into a character column vector, it is coercing the rest of the vector elements to NA.
DF <- data.frame(
X1=c(12,45,67,"-",9),
X2=c(34,45,56,"-",12))
str(DF)
'data.frame': 5 obs. of 2 variables:
$ X1: chr "12" "45" "67" "-" ...
$ X2: chr "34" "45" "56" "-" ...
##
DF2 <- DF
DF2$X1 <- gsub("-","0",DF2$X1)
DF2$X1 <- as.numeric(DF2$X1)
str(DF2)
'data.frame': 5 obs. of 2 variables:
$ X1: num 12 45 67 0 9
$ X2: chr "34" "45" "56" "-" ...
EDIT: To remove the commas in your values,
DF <- data.frame(
X0=c("A","B","C","D"),
X1=c("12,300.04","45.5","-","9,046.78"),
X2=c("1,0001.12","33","-","12.6"))
for(j in 2:ncol(DF)){
DF[,j] <- gsub(",","",as.character(DF[,j]))
for(i in 1:nrow(DF)){
if(nchar(DF[i,j])==1){
DF[i,j] <- gsub("-","0",DF[i,j])
} else {
next
}
}
DF[,j] <- as.numeric(DF[,j])
DF[,j]
}
There are more efficient ways of doing this with *apply functions and regular expressions but this should work. I had to account for the fact that some of your values are negative so assuming the cells with only a - in them are only one character long, this should fix them without affecting the negative values in other cells.
Assume it is named dat:
dat[2:6] <- lapply( dat[2:6], function(col) as.numeric( gsub("-$|\\,", "", col) ) )
dat[is.na(dat)] <- 0
Only replaces minus-signs at the end of a string, removes commas and the gsub coerces factors to character so you don't need to add as.character. When I imported your data using read.fwf and textConnection I got trailing spaces. You can either use gdata::trim to remove those first but this worked:
lapply(dat[2:6], function(col) as.numeric( gsub("-[ ]*$|\\,", "", col ) ) ) # on RHS
dat<-read.fwf(textConnection("Revenue 73.88 74.76 78.02 78.19 68.74
Other Revenue - Total - - - - -
Total Revenue 73.88 74.76 78.02 78.19 68.74
Cost of Revenue - Total 21.09 21.61 23.01 22.76 19.99
Gross Profit 52.80 -53.15 -55.01 55.43 48.75"), widths=c(24, rep(8,5)))
dat[2:6] <- lapply( dat[2:6], function(col) as.numeric( gsub("-$|\\,", "", col) ) )
dat[is.na(dat)] <- 0
dat
#----------
V1 V2 V3 V4 V5 V6
1 Revenue 73.88 74.76 78.02 78.19 68.74
2 Other Revenue - Total 0.00 0.00 0.00 0.00 0.00
3 Total Revenue 73.88 74.76 78.02 78.19 68.74
4 Cost of Revenue - Total 21.09 21.61 23.01 22.76 19.99
5 Gross Profit 52.80 -53.15 -55.01 55.43 48.75

Transforming an R object so that rows are transformed to rows and columns

I have a 3x168 dataframe in R. Each row has three columns - Day, Hour, and value. The day and hour corresponds to day of the week, the hour column corresponds to the hour on that day, and the value corresponds to the value with which I am concerned.
I am hoping to transform this data such that it exists in a 24x7 matrix, with a row (or column) corresponding to a particular day, and a column (or row) corresponding to a particular hour.
What is the most efficient way to do this in R? I've been able to throw together some messy strings of commands to get something close, but I have a feeling there is a very efficient solution.
Example starting data:
> print(data)
weekday hour value
1 M 1 1.11569683
2 M 2 -0.44550495
3 M 3 -0.82566259
4 M 4 -0.81427790
5 M 5 0.08277568
6 M 6 1.36057839
...
156 SU 12 0.12842608
157 SU 13 0.44697186
158 SU 14 0.86549961
159 SU 15 -0.22333317
160 SU 16 1.75955163
161 SU 17 -0.28904472
162 SU 18 -0.78826607
163 SU 19 -0.78520233
164 SU 20 -0.19301032
165 SU 21 0.65281161
166 SU 22 0.37993619
167 SU 23 -1.58806896
168 SU 24 -0.26725907
I'd hope to get something of the type:
M .... SU
1 1.11569683
2 -0.44550495
3 -0.82566259
4 -0.81427790
5
6
.
.
.
19
20
21 0.65281161
22 0.37993619
23 -1.58806896
24 -0.26725907
You can get some actual sample data this way:
weekday <- rep(c("M","T","W","TH","F","SA","SU"),each=24)
hour <- rep(1:24,7)
value <- rnorm(24*7)
data <- data.frame(weekday=weekday, hour=hour, value=value)
Thanks!
This is pretty trivial with the reshape2 package:
# Sample data - please include some with your next question!
x <- data.frame(day = c(rep("Sunday", 24),
rep("Monday", 24),
rep("Tuesday", 24),
rep("Wednesday", 24),
rep("Thursday", 24),
rep("Friday", 24),
rep("Saturday", 24)),
hour = rep(1:24, 7),
value = rnorm(n = 24 * 7)
)
library(reshape2)
# For rows representing hours
acast(x, hour ~ day)
# For rows representing days
acast(x, day ~ hour)
# If you want to preserve the ordering of the days, just make x$day a factor
# unique(x$day) conveniently gives the right order here, but you'd always want
# check that (and make sure the factor reflects the original value - that's why
# I'm making a new variable instead of overwriting the old one)
x$day.f <- factor(x$day, levels = unique(x$day))
acast(x, hour ~ day.f)
acast(x, day.f ~ hour)
The three-column dataset you have is an example of what's called "molten data" - each row represents a single result (x$value) with one or more identifiers (here, x$day and x$hour). The little formula inside of acast lets you express how you'd like your new dataset to be configured - variable names to the left of the tilde are used to define rows, and variable names to the right to define columns. In this case, there's only one column left - x$value - so it's automatically used to fill in the result matrix.
It took me a while to wrap my brain around all of that, but it's an incredibly powerful to think about reshaping data.
Something like this (assuming dfrm is the data object):
M <- matrix( NA, nrow=24, ncol=2,
dimnames = list(Hours = 1:24, Days=unique(dfrm$weekday) ) )
M[ cbind(dfrm$hour, dfrm$weekday) ] <- dfrm$value
> M
Days
Hours M SU
1 1.11569683 NA
2 -0.44550495 NA
3 -0.82566259 NA
4 -0.81427790 NA
5 0.08277568 NA
6 1.36057839 NA
7 NA NA
8 NA NA
9 NA NA
10 NA NA
11 NA NA
12 NA 0.1284261
13 NA 0.4469719
14 NA 0.8654996
15 NA -0.2233332
16 NA 1.7595516
17 NA -0.2890447
18 NA -0.7882661
19 NA -0.7852023
20 NA -0.1930103
21 NA 0.6528116
22 NA 0.3799362
23 NA -1.5880690
24 NA -0.2672591
Or you could just "fold the values" if they are "dense":
M <- matrix(dfrm$value, 24, 7)
And then rename your dimensions accordingly. Tested code provided when actual test cases provided.
This is pretty straightforward with xtabs in base R:
output <- as.data.frame.matrix(xtabs(value ~ hour + weekday, data))
head(output)
# SU M T W TH F SA
# 1 -0.56902302 -0.4434357 -1.02356300 -0.38459296 0.7098993 -0.54780300 1.5232637
# 2 0.01023058 -0.2559043 -2.79688932 -1.65322029 -1.5150986 0.05566206 -0.6706817
# 3 0.18461405 1.2783761 -0.02509352 -1.36763623 -0.4978633 0.20300678 1.4211054
# 4 0.54194889 0.5681317 0.69391876 -1.35805959 0.4208977 1.65256590 0.3622756
# 5 -1.68048536 -1.9274994 0.24036908 -0.21959772 0.7654983 1.62773579 0.6760743
# 6 -1.39398673 1.7251476 0.36563174 0.04554249 -0.2991433 -1.47331314 -0.7647513
TO get the days in the right order (as above), use factor on your "weekday" variable before doing the xtabs step:
data$weekday <- factor(data$weekday,
levels = c("SU", "M", "T", "W", "TH", "F", "SA"))

Resources