Summarize data already grouped in r - r

With the following dataset in R
ID=Custid
ID Geo Channel Brand Neworstream RevQ112 RevQ212 RevQ312
1 NA On-line 1 New 5 0 1
1 NA On-line 1 Stream 5 0 1
3 EU Tele 2 Stream 5 1 0
I would like to convert the dataset to this format of columns
ID Geo Brand Neworstream OnlineRevQ112 TeleRevQ112 OnlineRevQ212 TeleRevQ212
What is the best way to go about doing this? Can't figure out the best command in R.
Thanks in advance

You can use the reshape2 package and its melt and dcast functions to restructure your data.
data <- structure(list(ID = c(1L, 1L, 3L), Geo = structure(c(NA, NA,
1L), .Label = "EU", class = "factor"), Channel = structure(c(1L,
1L, 2L), .Label = c("On-line", "Tele"), class = "factor"), Brand = c(1L,
1L, 2L), Neworstream = structure(c(1L, 2L, 2L), .Label = c("New",
"Stream"), class = "factor"), RevQ112 = c(5L, 5L, 5L), RevQ212 = c(0L,
0L, 1L), RevQ312 = c(1L, 1L, 0L)), .Names = c("ID", "Geo", "Channel",
"Brand", "Neworstream", "RevQ112", "RevQ212", "RevQ312"), class = "data.frame", row.names = c(NA,
-3L))
library(reshape2)
## melt data
df_long<-melt(data,id.vars=c("ID","Geo","Channel","Brand","Neworstream"))
## recast in combinations of channel and time frame
dcast(df_long,... ~Channel+variable,sum)

Update/facepalm
The "NA" in your dataset probably aren't NA values but rather, the abbreviation "NA" for North America or something like that.
If you had used na.strings when reading your data in, you should have no problems using reshape as I originally indicated:
mydf <- read.table(header = TRUE, na.strings = "",
text = 'ID Geo Channel Brand Neworstream RevQ112 RevQ212 RevQ312
1 NA On-line 1 New 5 0 1
1 NA On-line 1 Stream 5 0 1
3 EU Tele 2 Stream 5 1 0')
reshape(mydf, direction = "wide",
idvar = c("ID", "Geo", "Brand", "Neworstream"),
timevar = "Channel")
(I might, however, recommend changing your abbreviation for legibility and to reduce confusion!)
Original Answer (since there's still something interesting about reshape there)
This should do it:
reshape(mydf, direction = "wide",
idvar = c("ID", "Geo", "Brand", "Neworstream"),
timevar = "Channel")
# ID Geo Brand Neworstream RevQ112.On-line RevQ212.On-line RevQ312.On-line
# 1 1 <NA> 1 New 5 0 1
# 3 3 EU 2 Stream NA NA NA
# RevQ112.Tele RevQ212.Tele RevQ312.Tele
# 1 NA NA NA
# 3 5 1 0
Update (To try to salvage the answer a little bit)
As #Arun points out, the above isn't quite right. The culprit here is interaction(), which is used by reshape() to create a new temporary ID variable when more than one ID variable is specified.
Here's the line from reshape() and what it looks like when applied to our "mydf" object:
data[, tempidname] <- interaction(data[, idvar], drop = TRUE)
interaction(mydf[c(1, 2, 4, 5)], drop = TRUE)
# [1] <NA> <NA> 3.EU.2.Stream
# Levels: 3.EU.2.Stream
Hmmm. This seems to simplify to two IDs, NA and 3.EU.2.Stream.
What happens if we replace NA with ""?
mydf$Geo <- as.character(mydf$Geo)
mydf$Geo[is.na(mydf$Geo)] <- ""
interaction(mydf[c(1, 2, 4, 5)], drop = TRUE)
# [1] 1..1.New 1..1.Stream 3.EU.2.Stream
# Levels: 1..1.New 1..1.Stream 3.EU.2.Stream
Aaahh. That's a little bit better. We now have three unique IDs... and reshape() seems to work.
reshape(mydf, direction = "wide",
idvar=names(mydf)[c(1, 2, 4, 5)],
timevar="Channel")
# ID Geo Brand Neworstream RevQ112.On-line RevQ212.On-line
# 1 1 1 New 5 0
# 2 1 1 Stream 5 0
# 3 3 EU 2 Stream NA NA
# RevQ312.On-line RevQ112.Tele RevQ212.Tele RevQ312.Tele
# 1 1 NA NA NA
# 2 1 NA NA NA
# 3 NA 5 1 0

Related

Fill in missing rows in data in R

Suppose I have a data frame like this:
1 8
2 12
3 2
5 -6
6 1
8 5
I want to add a row in the places where the 4 and 7 would have gone in the first column and have the second column for these new rows be 0, so adding these rows:
4 0
7 0
I have no idea how to do this in R.
In excel, I could use a vlookup inside an iferror. Is there a similar combo of functions in R to make this happen?
Edit: also, suppose that row 1 was missing and needed to be filled in similarly. Would this require another solution? What if I wanted to add rows until I reached ten rows?
Use tidyr::complete to fill in the missing sequence between min and max values.
library(tidyr)
library(rlang)
complete(df, V1 = min(V1):max(V1), fill = list(V2 = 0))
#Or using `seq`
#complete(df, V1 = seq(min(V1), max(V1)), fill = list(V2 = 0))
# V1 V2
# <int> <dbl>
#1 1 8
#2 2 12
#3 3 2
#4 4 0
#5 5 -6
#6 6 1
#7 7 0
#8 8 5
If we already know min and max of the dataframe we can use them directly. Let's say we want data from V1 = 1 to 10, we can do.
complete(df, V1 = 1:10, fill = list(V2 = 0))
If we don't know the column names beforehand, we can do something like :
col1 <- names(df)[1]
col2 <- names(df)[2]
complete(df, !!sym(col1) := 1:10, fill = as.list(setNames(0, col2)))
data
df <- structure(list(V1 = c(1L, 2L, 3L, 5L, 6L, 8L), V2 = c(8L, 12L,
2L, -6L, 1L, 5L)), class = "data.frame", row.names = c(NA, -6L))

Sorting Columns by Header Value

I'm very new to R and currently trying to move away from my reliance on excel. I have a file which I'm trying to clean up and the final step of my clean-up procedure is to order the columns by the header value. The headers contain values for e.g.:
"August.2019", "October.2019", "February.2020", "March 2020", "June
2019", July.2019, etc......."
I can't say exactly how many columns like this may exist, as this varies from file to file.
As you can see, they're not in any particular order. Is there a way to organise the columns in ascending order? Any guidance would be hugely appreciated.
One option would be to convert the column names to Date and then order
df[order(as.Date(paste0("01.", names(df)), "%d.%B.%Y"))]
Adding a reproducible example,
df <- structure(list(August.2019 = c(1L, 5L), October.2019 = 2:3,
February.2020 = 3:2, March.2020 = c(4L, 1L), June.2019. = c(5L, 2L),
July.2019 = c(6L,1L)), class = "data.frame", row.names = c(NA, -2L))
which looks like this
df
# August.2019 October.2019 February.2020 March.2020 June.2019. July.2019
#1 1 2 3 4 5 6
#2 5 3 2 1 2 1
df[order(as.Date(paste0("01.", names(df)), "%d.%B.%Y"))]
# June.2019. July.2019 August.2019 October.2019 February.2020 March.2020
#1 5 6 1 2 3 4
#2 2 1 5 3 2 1
We could use the same logic of converting columns to date using functions from other packages like lubridate/anytime or as.yearmon from zoo.
We can use setcolorder from data.table
library(zoo)
library(data.table)
setcolorder(df, order(as.yearmon(names(df), "%b.%Y")))
df
# June.2019. July.2019 August.2019 October.2019 February.2020 March.2020
#1 5 6 1 2 3 4
#2 2 1 5 3 2 1
data
df <- structure(list(August.2019 = c(1L, 5L), October.2019 = 2:3,
February.2020 = 3:2, March.2020 = c(4L, 1L), June.2019. = c(5L, 2L),
July.2019 = c(6L,1L)), class = "data.frame", row.names = c(NA, -2L))

Rearrange data by matching columns

I am having issue with rearranging some data.
The original data is:
structure(list(id = 1:3, artery.1 = structure(c(1L, 1L, 2L), .Label = c("a",
"b"), class = "factor"), artery.2 = structure(c(1L, NA, 2L), .Label = c("b",
"c"), class = "factor"), artery.3 = structure(c(1L, NA, 2L), .Label = c("c",
"d"), class = "factor"), artery.4 = structure(c(NA, NA, 1L), .Label = "e", class = "factor"), artery.5 = structure(c(NA, NA, 1L), .Label = "f", class = "factor"),
diameter.1 = c(3L, 2L, 1L), diameter.2 = c(2L, NA, 2L), diameter.3 = c(3L,
NA, 3L), diameter.4 = c(NA, NA, 4L), diameter.5 = c(NA, NA,
5L)), .Names = c("id", "artery.1", "artery.2", "artery.3",
"artery.4", "artery.5", "diameter.1", "diameter.2", "diameter.3",
"diameter.4", "diameter.5"), class = "data.frame", row.names = c(NA,
-3L))
# id artery.1 artery.2 artery.3 artery.4 artery.5 diameter.1 diameter.2 diameter.3 diameter.4 diameter.5
# 1 1 a b c <NA> <NA> 3 2 3 NA NA
# 2 2 a <NA> <NA> <NA> <NA> 2 NA NA NA NA
# 3 3 b c d e f 1 2 3 4 5
I would like to get to this:
structure(list(id = 1:3, a = c(3L, 2L, NA), b = c(2L, NA, 1L),
c = c(3L, NA, 2L), d = c(NA, NA, 3L), e = c(NA, NA, 4L),
f = c(NA, NA, 5L)), .Names = c("id", "a", "b", "c", "d",
"e", "f"), class = "data.frame", row.names = c(NA, -3L))
# id a b c d e f
# 1 1 3 2 3 NA NA NA
# 2 2 2 NA NA NA NA NA
# 3 3 NA 1 2 3 4 5
Basically, a to f represents arteries and the numerical values represent the corresponding diameter. Each row represents a patient.
Is there a neat way to sort this dataframe out?
Modern tidyr makes the solution even more succinct via the pivot_ functions:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(-id, names_pattern = '(artery|diameter)\\.(\\d+)', names_to = c('.value', NA)) %>%
filter(!is.na(artery)) %>%
pivot_wider(names_from = artery, values_from = diameter)
id a b c d e f
<int> <int> <int> <int> <int> <int> <int>
1 1 3 2 3 NA NA NA
2 2 2 NA NA NA NA NA
3 3 NA 1 2 3 4 5
Here is the older solution, which uses the deprecated gather and spread functions:
library(dplyr)
library(tidyr)
new.df <- gather(df, variable, value, artery.1:diameter.5) %>%
separate(variable, c('variable', 'num')) %>%
spread(variable, value) %>%
subset(!is.na(artery)) %>%
mutate(diameter = as.numeric(diameter)) %>%
select(-num) %>%
spread(artery, diameter)
Output:
id a b c d e f
1 1 3 2 3 NA NA NA
2 2 2 NA NA NA NA NA
3 3 NA 1 2 3 4 5
Or using melt/dcast combination with data.table while selecting variables using regex in the patterns function
library(data.table) #v>=1.9.6
dcast(melt(setDT(df),
id = "id",
measure = patterns("artery", "diameter")),
id ~ value1,
sum,
value.var = "value2",
subset = .(!is.na(value2)),
fill = NA)
# id a b c d e f
# 1: 1 3 2 3 NA NA NA
# 2: 2 2 NA NA NA NA NA
# 3: 3 NA 1 2 3 4 5
As you can see, both melt and dcast are very flexible and you can use regex, specify a subset, pass multiple functions and specify how you want to fill missing values.
You can use xtabs with reshape from base R. Use the latter to transform data to long format and use the former to get the count table:
xtabs(diameter ~ id + artery, reshape(df, varying = 2:11, sep = '.', dir = "long"))
# artery
#id a b c d e f
# 1 3 2 3 0 0 0
# 2 2 0 0 0 0 0
# 3 0 1 2 3 4 5
This can be done with two reshape() calls. First, we can longify both artery and diameter on id, then widen with artery as the time variable. To prevent a column of NAs, we also must subset out rows with NA values for artery in the intermediate frame.
reshape(subset(reshape(df,dir='l',varying=setdiff(names(df),'id'),timevar=NULL),!is.na(artery)),dir='w',timevar='artery');
## id diameter.a diameter.b diameter.c diameter.d diameter.e diameter.f
## 1.1 1 3 2 3 NA NA NA
## 2.1 2 2 NA NA NA NA NA
## 3.1 3 NA 1 2 3 4 5
The diameter. prefixes can be removed afterward, if desired. However, an advantage of this solution is that it would be capable of preserving multiple column sets, whereas the xtabs() solution cannot. The prefixes would be essential to distinguish the column sets in that case.

Restructuring data using apply family of functions

I have inherited a data set that is 23 attributes measured for each of 13 names (between-subjects--each participant only rated one name on all of these attributes). Right now it's structured such that the attributes are the fastest-moving factor, followed by the name. So the the data look like this:
Sub# N1-item1 N1-item2 N1-item3 […] N2-item1 N2-item2 N2-item3
1 3 5 3 NA NA NA
2 NA NA NA 1 5 3
3 3 5 3 NA NA NA
4 NA NA NA 2 2 1
It needs to be restructured it such that it's collapsed over name, and all of the item1 entries are the same column (subjects don't matter for this purpose), as below (bearing in mind that there are 23 items not 3 and 13 names not 2):
Name item1 item2 item3
N1 3 5 3
N2 1 5 3
I can do this with loops and, but I'd rather do it in a manner more natural to R, which I'm guessing would be one of the apply family of functions, but I can't quite wrap my head around it--what is the smart way to do this?
Here's an answer using dplyr and tidyr:
library(dplyr)#loads libraries
library(tidyr)
dat %>% #name of your dataframe
gather(key, val, -Sub) %>% #gathers to long data, with id as Sub
filter(!is.na(val)) %>% #removes rows with NA for the value
separate(key, c("Name", "item")) %>% #split the column key into Name and item
spread(item, val) #spreads the data into wide format, with item as the columns
Sub Name item1 item2 item3
1 1 N1 3 5 3
2 2 N2 1 5 3
3 3 N1 3 5 3
4 4 N2 2 2 1
Spin the column names around to be itemX-NY and then let reshape sort it out:
names(dat)[-1] <- gsub("(^.+?)-(.+?$)", "\\2-\\1", names(dat)[-1])
na.omit(reshape(dat, direction="long", idvar="Sub", varying=-1, sep="-"))
# Sub time item1 item2 item3
#1.N1 1 N1 3 5 3
#3.N1 3 N1 3 5 3
#2.N2 2 N2 1 5 3
#4.N2 4 N2 2 2 1
Where the data was:
dat <- structure(list(Sub = 1:4, `item1-N1` = c(3L, NA, 3L, NA), `item2-N1` = c(5L,
NA, 5L, NA), `item3-N1` = c(3L, NA, 3L, NA), `item1-N2` = c(NA,
1L, NA, 2L), `item2-N2` = c(NA, 5L, NA, 2L), `item3-N2` = c(NA,
3L, NA, 1L)), .Names = c("Sub", "item1-N1", "item2-N1", "item3-N1",
"item1-N2", "item2-N2", "item3-N2"), row.names = c(NA, -4L), class = "data.frame

R - Creating conditional variables with loop

I have a dataset like this (but this is just a subset; the real dataset has hundreds of ID_Desc variables), where each data point has a person's gender, and whether they checked off a number of descriptors (1) or not (NA):
Gender ID1_Desc_1 ID1_Desc_2 ID1_Desc_3 ID2_Desc_1 ID2_Desc_2 ID2_Desc_3 ID3_Desc_1 ID3_Desc_2 ID3_Desc_3
1 NA NA 1 NA NA 1 NA NA NA
2 NA 1 1 NA NA NA 1 1 NA
1 1 1 1 NA 1 NA NA NA NA
I'm trying to write a loop that will (1) check their gender, (2) based on their gender, check whether they checked off the same descriptor in the first list they saw (lists ID1 and ID2 for Gender=1 and lists ID1 and ID3 for Gender=2), and (3) create a new variable (Same#) that indicates whether they checked off the same descriptor in both lists (by writing a 1) or not (by writing a 0).
I've been working with this code, which seems to be checking their gender ok and creating the new variables (Same#), but it's writing 0's for everything, which is not correct:
for (i in 1:3){
assign(paste("Same",i,sep=""),
ifelse(Gender=="1",
ifelse(paste("ID1_Desc_",i,sep="")==paste("ID2_Desc_",i,sep=""),1,0),
ifelse(paste("ID1_Desc_",i,sep="")==paste("ID3_Desc_",i,sep=""),1,0)
)
)
}
Based on the data I provided, Same1 should be 0 0 1 (since Gender=1 and they chose Desc_3 in both the ID1 and ID2 lists), Same2 should be 0 1 0 (since Gender=2 and they chose Desc_2 in both the ID1 and ID3 lists), and Same3 should be 0 1 0 (since Gender=1 and they chose Desc_2 in both the ID1 and ID2 lists) but right now, all 3 come out as 0 0 0.
I know using loops may not be the best way to do this, but I'd really like to know how to do it with loop if it's possible. If not, anything that works would be incredibly appreciated. Thanks.
You may try this
ind1 <- grep("^ID1", colnames(df))
ind2 <- grep("^ID2", colnames(df))
ind3 <- grep("^ID3", colnames(df))
cond1 <- do.call(cbind,Map(`==` , df[ind1], df[ind2]))
cond2 <- do.call(cbind,Map(`==` , df[ind1], df[ind3]))
Finalind <- do.call(cbind, Map(`|`, as.data.frame(t(cond1)),
as.data.frame(t(cond2))))
res <- (!is.na(Finalind))+0
rownames(res) <- paste0("Same", 1:3)
t(res)
# Same1 Same2 Same3
#V1 0 0 1
#V2 0 1 0
#V3 0 1 0
cbind(df, t(res))
data
df <- structure(list(Gender = c(1L, 2L, 1L), ID1_Desc_1 = c(NA, NA,
1L), ID1_Desc_2 = c(NA, 1L, 1L), ID1_Desc_3 = c(1L, 1L, 1L),
ID2_Desc_1 = c(NA, NA, NA), ID2_Desc_2 = c(NA, NA, 1L), ID2_Desc_3 = c(1L,
NA, NA), ID3_Desc_1 = c(NA, 1L, NA), ID3_Desc_2 = c(NA, 1L,
NA), ID3_Desc_3 = c(NA, NA, NA)), .Names = c("Gender", "ID1_Desc_1",
"ID1_Desc_2", "ID1_Desc_3", "ID2_Desc_1", "ID2_Desc_2", "ID2_Desc_3",
"ID3_Desc_1", "ID3_Desc_2", "ID3_Desc_3"), class = "data.frame",
row.names = c(NA, -3L))

Resources