How to gather series of columns with data into rows [duplicate] - r

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 5 years ago.
I'm just trying to get my head around tidying my data and I have this problem:
I have data as follows:
ID Tx1 Tx1Date Tx1Details Tx2 Tx2Date Tx2Details Tx3 Tx1Date Tx1Details
1 14 12/3/14 blabla 1e 12/5/14 morebla r 14/2/14 grrr
2 23 14/5/16 albalb 342 1/4/5 teeee s 5/6/17 purrr
I want the data to be in the format
ID Tx TxDate TxDetails
1 14 12/3/14 blabla
1 1e 12/5/14 morebla
1 r 14/2/14 grrr
2 23 14/5/16 albalb
2 342 1/4/5 teeee
2 s 5/6/17 purrr
I have used
library(tidyr)
library(dplyr)
NewData<-mydata %>% gather(key, value, "ID", 2:10)
but I'm not sure how to rename the columns as per the intended output to see if this will work

You can rename your data frame column names to a more conventional separable names and then use the base reshape function, assuming your initial data frames looks like this(changed the last two column names to Tx3Date and Tx3Details as otherwise they are duplicates of columns 4 and 5):
df
# ID Tx1 Tx1Date Tx1Details Tx2 Tx2Date Tx2Details Tx3 Tx3Date Tx3Details
#1 1 14 12/3/14 blabla 1e 12/5/14 morebla r 14/2/14 grrr
#2 2 23 14/5/16 albalb 342 1/4/5 teeee s 5/6/17 purrr
names(df) <- gsub("(\\d)(\\w*)", "\\2\\.\\1", names(df))
df
# ID Tx.1 TxDate.1 TxDetails.1 Tx.2 TxDate.2 TxDetails.2 Tx.3 TxDate.3 TxDetails.3
#1 1 14 12/3/14 blabla 1e 12/5/14 morebla r 14/2/14 grrr
#2 2 23 14/5/16 albalb 342 1/4/5 teeee s 5/6/17 purrr
reshape(df, varying = 2:10, idvar = "ID", dir = "long")
# ID time Tx TxDate TxDetails
#1.1 1 1 14 12/3/14 blabla
#2.1 2 1 23 14/5/16 albalb
#1.2 1 2 1e 12/5/14 morebla
#2.2 2 2 342 1/4/5 teeee
#1.3 1 3 r 14/2/14 grrr
#2.3 2 3 s 5/6/17 purrr
Drop the redundant time variable if you don't need it.

The data.table package handles this pretty well.
library(data.table)
setDT(df)
melt(df, measure = list(Tx = grep("^Tx[0-3]$", names(df)),
Date = grep("Date", names(df)),
Details = grep("Details", names(df))),
value.name = c("Tx", "TxDate", "TxDetails"))
Or more concisely
melt(df, measure = patterns("^Tx[0-3]$", "Date", "Details"),
value.name = c("Tx", "TxDate", "TxDetails"))

Related

Move information to new column if the first value of the cell is a four-digit number

I have a column with addresses. The data is not clean and the information includes street and house number or sometimes postcode and city. I would like to move the postcode and city information to another column with R, while street and house number stay in the old place. The postcode is a 4 digit number string. I am grateful for any suggestion for a solution.
An ifelse with grepl should help -
library(dplyr)
df <- df %>%
mutate(Strasse = ifelse(grepl('^\\d{4}', Halter), '', Halter),
Ort = ifelse(Strasse == '', Halter, ''))
# Line Halter Strasse Ort
#1 1 1007 Abc 1007 Abc
#2 2 1012 Long words 1012 Long words
#3 3 Enelbach 54 Enelbach 54
#4 4 Abcd 56 Abcd 56
#5 5 Engasse 21 Engasse 21
grepl('^\\d{4}', Halter) returns TRUE if it finds a 4-digit number at the start of the string else returns FALSE.
data
It is easier to help if you provide data in a reproducible format
df <- data.frame(Line = 1:5,
Halter = c('1007 Abc', '1012 Long words', 'Enelbach 54',
'Abcd 56', 'Engasse 21'))
In addition to the neat solution of #Ronak Shah, if you want to use base R
df <- data.frame(Line = 1:5,
Halter = c('1007 Abc', '1012 Long words', 'Enelbach 54',
'Abcd 56', 'Engasse 21'))
df$Strasse <- with(df, ifelse(grepl('^\\d{4}', Halter), '', Halter))
df$Ort <- with(df, ifelse(Strasse == '', Halter, ''))
> head(df)
Line Halter Strasse Ort
1 1 1007 Abc 1007 Abc
2 2 1012 Long words 1012 Long words
3 3 Enelbach 54 Enelbach 54
4 4 Abcd 56 Abcd 56
5 5 Engasse 21 Engasse 21
An option is also with separate
library(dplyr)
library(tidyr)
df %>%
separate(Halter, into = c("Strasse", "Ort"), sep = "(?<=[0-9])$|^(?=[0-9]{4} )")
Line Strasse Ort
1 1 1007 Abc
2 2 1012 Long words
3 3 Enelbach 54
4 4 Abcd 56
5 5 Engasse 21
data
df <- structure(list(Line = 1:5, Halter = c("1007 Abc", "1012 Long words",
"Enelbach 54", "Abcd 56", "Engasse 21")), class = "data.frame", row.names = c(NA,
-5L))
Suisse postal codes are made up of 4 digits:
library(dplyr)
library(stringr)
df %>%
mutate(Strasse = str_extract(Halter, '\\d{4}\\s.+'))
Line Halter Strasse
1 1 1007 Abc 1007 Abc
2 2 1012 Long words 1012 Long words
3 3 Enelbach 54 <NA>
4 4 Abcd 56 <NA>
5 5 Engasse 21 <NA>

converting an abbreviation into a full word

I am trying to avoid writing a long nested ifelse statement in excel.
I am working on two datasets, one where I have abbreviations and county names.
Abbre
COUNTY_NAME
1 AD Adams
2 AS Asotin
3 BE Benton
4 CH Chelan
5 CM Clallam
6 CR Clark
And another data set that contains the county abbreviation and votes.
CountyCode Votes
1 WM 97
2 AS 14
3 WM 163
4 WM 144
5 SJ 21
For the second table, how do I convert the countycode (abbreviation) into the full spelled-out text and add that as a new column?
I have been trying to solve this unsuccessfully using grep, match, and %in%. Clearly I am missing something and any insight would be greatly appreciated.
We can use a join
library(dplyr)
library(tidyr)
df2 <- df2 %>%
left_join(Abbre %>%
separate(COUNTY_NAME, into = c("CountyCode", "FullName")),
by = "CountyCode")
Or use base R
tmp <- read.table(text = Abbre$COUNTY_NAME, header = FALSE,
col.names = c("CountyCode", "FullName"))
df2 <- merge(df2, tmp, by = 'CountyCode', all.x = TRUE)
Another base R option using match
df2$COUNTY_NAME <- with(
df1,
COUNTY_NAME[match(df2$CountyCode, Abbre)]
)
gives
> df2
CountyCode Votes COUNTY_NAME
1 WM 97 <NA>
2 AS 14 Asotin
3 WM 163 <NA>
4 WM 144 <NA>
5 SJ 21 <NA>
A data.table option
> setDT(df1)[setDT(df2), on = .(Abbre = CountyCode)]
Abbre COUNTY_NAME Votes
1: WM <NA> 97
2: AS Asotin 14
3: WM <NA> 163
4: WM <NA> 144
5: SJ <NA> 21

How to Reshape data (with col name parsing)

need to reshape a data.frame from this
TestID Machine1Measure Machine1Count Machine2Measure Machine2Count
1 10006 11 14 16 24
2 10007 23 27 32 35
To this:
TestID Machine Measure Count
1 10006 1 11 14
2 10006 2 16 24
3 10007 1 23 27
4 10007 2 32 35
Below is code to create each. Looked at reshape in R but couldn't figure out how to split the names
Note: this is a subset of the columns - there are 70-140 machines. How can I make this simpler?
b <-data.frame(10006:10007, matrix(c(11,23,14,27,16,32,24,35),2,4))
colnames(b) <- c("TestID", "Machine1Measure", "Machine1Count", "Machine2Measure", "Machine2Count")
a<-data.frame(matrix(c(10006,10006,10007,10007,1,2,1,2,11,16,23,32,14,24,27,35),4,4))
colnames(a) <- c("TestID", "Machine", "Measure", "Count")
b
a
The following reproduces your expected output:
df %>%
gather(key, value, -TestID) %>%
separate(key, into = c("tmp", "what"), sep = "(?<=\\d)") %>%
separate(tmp, into = c("tmp", "Machine"), sep = "(?=\\d+)") %>%
spread(what, value) %>%
select(-tmp)
# TestID Machine Count Measure
#1 10006 1 14 11
#2 10006 2 24 16
#3 10007 1 27 23
#4 10007 2 35 32
Explanation: We reshape data from wide to long, and use two separate calls to separate the various values and ids before reshaping again from long to wide. (We use a positive look-ahead and positive look-behind to separate the keys into the required fields.)
Sample data
df <- read.table(text =
" TestID Machine1Measure Machine1Count Machine2Measure Machine2Count
1 10006 11 14 16 24
2 10007 23 27 32 35", header = T)
data.table can do all this within one melt, which is almost 30x faster than the (perfectly working) tidyverse solution provided by MauritsEvers.
It uses patterns to define the columns with 'Measure' and 'Count' in their names, and then melts these columns to the columns names in value.name
library( data.table )
melt( setDT( b),
id.vars = c("TestID"),
measure.vars = patterns( ".*Measure", ".*Count"),
variable.name = "Machine",
value.name = c("Measure", "Count") )
# TestID Machine Measure Count
# 1: 10006 1 11 14
# 2: 10007 1 23 27
# 3: 10006 2 16 24
# 4: 10007 2 32 35
Benchmarking
# Unit: microseconds
# expr min lq mean median uq max neval
# data.table 182.265 200.3405 245.0403 234.0825 264.6605 3137.967 1000
# reshape 1757.575 1840.7240 2180.4957 1938.3335 2011.3895 100429.392 1000
# tidyverse 6173.203 6430.7830 6925.6034 6569.9670 6763.9810 29722.714 1000
And since nobody else likes reshape() any longer, I'll add an answer:
reshape(
setNames(b, sub("^.+(\\d+)(.+)$", "\\2.\\1", names(b))),
idvar="TestID", direction="long", varying=-1, timevar="Machine"
)
# TestID Machine Measure Count
#10006.1 10006 1 11 14
#10007.1 10007 1 23 27
#10006.2 10006 2 16 24
#10007.2 10007 2 32 35
It'll never compete with data.table for pure speed, but brief testing on 2M rows using:
bbig <- b[rep(1:2,each=1e6),]
bbig$TestID <- make.unique(as.character(bbig$TestID))
#data.table - 0.06 secs
#reshape - 2.30 secs
#tidyverse - 56.60 secs

How can I transpose dataset in R [duplicate]

This question already has answers here:
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Closed 5 years ago.
I have a dataset A shown as below. How can I transform dataset A to dataset B. Dataset A contains over 10,000 observations in my file. Is there any easy way to do it?
Dataset A:
Line 1:AB 12 23
Line 2:AB 34 56
Line 3:CD 78 90
Line 4:EF 13 45
Dataset B:
Line 1:AB 12 23 34 56
Line 2:CD 78 90 NA NA
Line 3:EF 13 45 NA NA
Try this by using cSplit
library(splitstackshape)
library(dplyr)
DatA['new']=apply(DatA[,-1], 1, paste, collapse=",")
DatA=DatA%>%group_by(Alphabet)%>%summarise(new=paste(new,collapse=','))
cSplit(DatA, 2, drop = TRUE,sep=',')
Alphabet new_1 new_2 new_3 new_4
1: AB 12 23 34 56
2: CD 78 90 NA NA
3: EF 13 45 NA NA
Data input
DatA <- data.frame(Alphabet = c("AB", "AB", "CD","EF"),
Value1 = c(12,34,78,13),Value2 = c(23,56,90,45),stringsAsFactors = F)

Reshaping a data frame --- changing rows to columns

Suppose that we have a data frame that looks like
set.seed(7302012)
county <- rep(letters[1:4], each=2)
state <- rep(LETTERS[1], times=8)
industry <- rep(c("construction", "manufacturing"), 4)
employment <- round(rnorm(8, 100, 50), 0)
establishments <- round(rnorm(8, 20, 5), 0)
data <- data.frame(state, county, industry, employment, establishments)
state county industry employment establishments
1 A a construction 146 19
2 A a manufacturing 110 20
3 A b construction 121 10
4 A b manufacturing 90 27
5 A c construction 197 18
6 A c manufacturing 73 29
7 A d construction 98 30
8 A d manufacturing 102 19
We'd like to reshape this so that each row represents a (state and) county, rather than a county-industry, with columns construction.employment, construction.establishments, and analogous versions for manufacturing. What is an efficient way to do this?
One way is to subset
construction <- data[data$industry == "construction", ]
names(construction)[4:5] <- c("construction.employment", "construction.establishments")
And similarly for manufacturing, then do a merge. This isn't so bad if there are only two industries, but imagine that there are 14; this process would become tedious (though made less so by using a for loop over the levels of industry).
Any other ideas?
This can be done in base R reshape, if I understand your question correctly:
reshape(data, direction="wide", idvar=c("state", "county"), timevar="industry")
# state county employment.construction establishments.construction
# 1 A a 146 19
# 3 A b 121 10
# 5 A c 197 18
# 7 A d 98 30
# employment.manufacturing establishments.manufacturing
# 1 110 20
# 3 90 27
# 5 73 29
# 7 102 19
Also using the reshape package:
library(reshape)
m <- reshape::melt(data)
cast(m, state + county~...)
Yielding:
> cast(m, state + county~...)
state county construction_employment construction_establishments manufacturing_employment manufacturing_establishments
1 A a 146 19 110 20
2 A b 121 10 90 27
3 A c 197 18 73 29
4 A d 98 30 102 19
I personally use the base reshape so I probably should have shown this using reshape2 (Wickham) but forgot there was a reshape2 package. Slightly different:
library(reshape2)
m <- reshape2::melt(data)
dcast(m, state + county~...)

Resources