Move information to new column if the first value of the cell is a four-digit number - r

I have a column with addresses. The data is not clean and the information includes street and house number or sometimes postcode and city. I would like to move the postcode and city information to another column with R, while street and house number stay in the old place. The postcode is a 4 digit number string. I am grateful for any suggestion for a solution.

An ifelse with grepl should help -
library(dplyr)
df <- df %>%
mutate(Strasse = ifelse(grepl('^\\d{4}', Halter), '', Halter),
Ort = ifelse(Strasse == '', Halter, ''))
# Line Halter Strasse Ort
#1 1 1007 Abc 1007 Abc
#2 2 1012 Long words 1012 Long words
#3 3 Enelbach 54 Enelbach 54
#4 4 Abcd 56 Abcd 56
#5 5 Engasse 21 Engasse 21
grepl('^\\d{4}', Halter) returns TRUE if it finds a 4-digit number at the start of the string else returns FALSE.
data
It is easier to help if you provide data in a reproducible format
df <- data.frame(Line = 1:5,
Halter = c('1007 Abc', '1012 Long words', 'Enelbach 54',
'Abcd 56', 'Engasse 21'))

In addition to the neat solution of #Ronak Shah, if you want to use base R
df <- data.frame(Line = 1:5,
Halter = c('1007 Abc', '1012 Long words', 'Enelbach 54',
'Abcd 56', 'Engasse 21'))
df$Strasse <- with(df, ifelse(grepl('^\\d{4}', Halter), '', Halter))
df$Ort <- with(df, ifelse(Strasse == '', Halter, ''))
> head(df)
Line Halter Strasse Ort
1 1 1007 Abc 1007 Abc
2 2 1012 Long words 1012 Long words
3 3 Enelbach 54 Enelbach 54
4 4 Abcd 56 Abcd 56
5 5 Engasse 21 Engasse 21

An option is also with separate
library(dplyr)
library(tidyr)
df %>%
separate(Halter, into = c("Strasse", "Ort"), sep = "(?<=[0-9])$|^(?=[0-9]{4} )")
Line Strasse Ort
1 1 1007 Abc
2 2 1012 Long words
3 3 Enelbach 54
4 4 Abcd 56
5 5 Engasse 21
data
df <- structure(list(Line = 1:5, Halter = c("1007 Abc", "1012 Long words",
"Enelbach 54", "Abcd 56", "Engasse 21")), class = "data.frame", row.names = c(NA,
-5L))

Suisse postal codes are made up of 4 digits:
library(dplyr)
library(stringr)
df %>%
mutate(Strasse = str_extract(Halter, '\\d{4}\\s.+'))
Line Halter Strasse
1 1 1007 Abc 1007 Abc
2 2 1012 Long words 1012 Long words
3 3 Enelbach 54 <NA>
4 4 Abcd 56 <NA>
5 5 Engasse 21 <NA>

Related

converting an abbreviation into a full word

I am trying to avoid writing a long nested ifelse statement in excel.
I am working on two datasets, one where I have abbreviations and county names.
Abbre
COUNTY_NAME
1 AD Adams
2 AS Asotin
3 BE Benton
4 CH Chelan
5 CM Clallam
6 CR Clark
And another data set that contains the county abbreviation and votes.
CountyCode Votes
1 WM 97
2 AS 14
3 WM 163
4 WM 144
5 SJ 21
For the second table, how do I convert the countycode (abbreviation) into the full spelled-out text and add that as a new column?
I have been trying to solve this unsuccessfully using grep, match, and %in%. Clearly I am missing something and any insight would be greatly appreciated.
We can use a join
library(dplyr)
library(tidyr)
df2 <- df2 %>%
left_join(Abbre %>%
separate(COUNTY_NAME, into = c("CountyCode", "FullName")),
by = "CountyCode")
Or use base R
tmp <- read.table(text = Abbre$COUNTY_NAME, header = FALSE,
col.names = c("CountyCode", "FullName"))
df2 <- merge(df2, tmp, by = 'CountyCode', all.x = TRUE)
Another base R option using match
df2$COUNTY_NAME <- with(
df1,
COUNTY_NAME[match(df2$CountyCode, Abbre)]
)
gives
> df2
CountyCode Votes COUNTY_NAME
1 WM 97 <NA>
2 AS 14 Asotin
3 WM 163 <NA>
4 WM 144 <NA>
5 SJ 21 <NA>
A data.table option
> setDT(df1)[setDT(df2), on = .(Abbre = CountyCode)]
Abbre COUNTY_NAME Votes
1: WM <NA> 97
2: AS Asotin 14
3: WM <NA> 163
4: WM <NA> 144
5: SJ <NA> 21

How to create a Markdown table with different column lengths based on a dataframe in long format in R?

I'm working on a R Markdown file that I would like to submit as a manuscript to an academic journal. I would like to create a table that shows which three words (item2) co-occur most frequently with some keywords (item1). Note that some key words have more than three co-occurring words. The data that I am currently working with:
item1 <- c("water","water","water","water","water","sun","sun","sun","sun","moon","moon","moon")
item2 <- c("tree","dog","cat","fish","eagle","bird","table","bed","flower","house","desk","tiger")
n <- c("200","83","34","34","34","300","250","77","77","122","46","46")
df <- data.frame(item1,item2,n)
Which gives this dataframe:
item1 item2 n
1 water tree 200
2 water dog 83
3 water cat 34
4 water fish 34
5 water eagle 34
6 sun bird 300
7 sun table 250
8 sun bed 77
9 sun flower 77
10 moon house 122
11 moon desk 46
12 moon tiger 46
Ultimately, I would like to pass the data to the function papaja::apa_table, which requires a data.frame (or a matrix / list). I therefore need to reshape the data.
My question:
How can I reshape the data (preferably with dplyr) to get the following structure?
water_item2 water_n sun_item2 sun_n moon_item2 moon_n
1 tree 200 bird 300 house 122
2 dog 83 table 250 desk 46
3 cat 34 bed 77 tiger 46
4 fish 34 flower 77 <NA> <NA>
5 eagle 34 <NA> <NA> <NA> <NA>
We can borrow an approach from an old answer of mine to a different question, and modify a classic gather(), unite(), spread() strategy by creating unique identifiers by group to avoid duplicate identifiers, then dropping that variable:
library(dplyr)
library(tidyr)
item1 <- c("water","water","water","water","water","sun","sun","sun","sun","moon","moon","moon")
item2 <- c("tree","dog","cat","fish","eagle","bird","table","bed","flower","house","desk","tiger")
n <- c("200","83","34","34","34","300","250","77","77","122","46","46")
# Owing to Richard Telford's excellent comment,
# I use data_frame() (or equivalently for our purposes,
# data.frame(..., stringsAsFactors = FALSE))
# to avoid turning the strings into factors
df <- data_frame(item1,item2,n)
df %>%
group_by(item1) %>%
mutate(id = 1:n()) %>%
ungroup() %>%
gather(temp, val, item2, n) %>%
unite(temp2, item1, temp, sep = '_') %>%
spread(temp2, val) %>%
select(-id)
# A tibble: 5 x 6
moon_item2 moon_n sun_item2 sun_n water_item2 water_n
<chr> <chr> <chr> <chr> <chr> <chr>
1 house 122 bird 300 tree 200
2 desk 46 table 250 dog 83
3 tiger 46 bed 77 cat 34
4 NA NA flower 77 fish 34
5 NA NA NA NA eagle 34

How can I extract elements and value from series of strings and arrange it rightly?

I have a data frame like this,
DF1= c(
"Name : John Miller, Math : 100, History : 80, Physics: 90",
"Name : Mary Smith, French : 99, History : 90, Physics: 89",
"Name : Eddy Abbot, Math : 90, French : 85, Chemistry : 90"
)
Would like to make it a data.table in this way (better in data.table format)
Name Math French History Physics Chemistry
1: John Miller 100 NA 80 90 NA
2: Mary Smith NA 99 90 89 NA
3: Eddy Abbot 90 85 NA NA 90
Wondering if my idea is at the right direction:
Split the strings into words based on ",".
Get the keywords, "French, "Math", etc, based on " : ".
Fill in the right row and right col with the value respectively. and done.
Would like to invite advice on step 3 and many thanks.
Replace each comma and end-of-line with a newline and each space-colon with just colon. Read that using readLines to break up the strings into separate lines and use trimws to remove any junk whitespace. At this point the file is in Debian Control Format (DCF) so we can use read.dcf to read it creating character matrix m. Now convert m to data.table and convert the types.
dcf <- trimws(readLines(textConnection(gsub(" :", ":", gsub(",|$", "\n", DF1)))))
m <- read.dcf(textConnection(dcf))
DT <- as.data.table(m)[, lapply(.SD, type.convert, as.is = TRUE)]
giving:
> DT
Name Math History Physics French Chemistry
1: John Miller 100 80 90 NA NA
2: Mary Smith NA 90 89 99 NA
3: Eddy Abbot 90 NA NA 85 90
Note
We used the object name DF1 for consistency with the question but it is a character vector, not a data frame, so you might want to choose a different name for it.
We convert it to a tibble,create a row names column ('rn'), expand the rows by splitting at , (separate_rows), separate the 'col' at : into 'col1' and 'col2', spread it to 'wide' format, and change the type
library(tidyverse)
tibble(col = DF1) %>%
rownames_to_column('rn') %>%
separate_rows(col, sep = "\\s*,\\s*") %>%
separate(col, into = c('col1', 'col2'), sep="\\s*:\\s*") %>%
spread(col1, col2) %>%
select(-rn) %>%
mutate_all(type.convert, as.is = TRUE) %>%
select(Name, Math, French, History, Physics, Chemistry)
# A tibble: 3 x 6
# Name Math French History Physics Chemistry
# <chr> <int> <int> <int> <int> <int>
#1 John Miller 100 NA 80 90 NA
#2 Mary Smith NA 99 90 89 NA
#3 Eddy Abbot 90 85 NA NA 90
It is also possible to convert to JSON format and then use fromJSON
library(jsonlite)
out <- fromJSON(paste0("[", paste("{", gsub('"(\\d+)"', "\\1",
gsub('(\\w+)\\s*:\\s*([^,]+)', '"\\1":"\\2"', DF1)), "}", sep="", collapse=",\n"), "]"))
out
# Name Math History Physics French Chemistry
#1 John Miller 100 80 90 NA NA
#2 Mary Smith NA 90 89 99 NA
#3 Eddy Abbot 90 NA NA 85 90

Assign one name for one id from similar names

I have 1million observations and 4 variables(ID, NAME, COMPANY, TIPS)
My ID values are correctly mapped but in NAME column contains full names and some have an only first name but for sure end of each id (2,3,4) have a full name only, so I want to replace full name to all id so that I display one id and one correct name.
sample data table as below(Dt - Format)
ID Name Company Tips
1 Dave AB 50
2 PAT E DAV ABC 15
2 PAT ERIN DAV(full name) AB 26
3 JIL WIRTH DFG 26
3 JIL K WIRTH EF 45
3 JILL KATH WIRTH(full name) JUI 85
4 MARIANA PO KIL 50
4 MARIANA A PO(full name) LPI 55
5 BRET LLC 52
Expected Output
ID Name Company Tips
1 Dave AB 50
2 PAT ERIN DAV ABC 15
2 PAT ERIN DAV AB 26
3 JIL KATH WIRTH DFG 26
3 JIL KATH WIRTH EF 45
3 JILL KATH WIRTH JUI 85
4 MARIANA A PO KIL 50
4 MARIANA A PO LPI 55
5 BRET LLC 52
One way would be to take the longest name for each ID. Here is a way using dplyr...
library(dplyr)
df <- df %>% group_by(ID) %>% mutate(Name2=Name[which.max(nchar(Name))])
df
ID Name Company Tips Name2
<int> <chr> <chr> <int> <chr>
1 1 Dave AB 50 Dave
2 2 PAT E DAV ABC 15 PAT ERIN DAV
3 2 PAT ERIN DAV AB 26 PAT ERIN DAV
4 3 JIL WIRTH DFG 26 JILL KATH WIRTH
5 3 JIL K WIRTH EF 45 JILL KATH WIRTH
6 3 JILL KATH WIRTH JUI 85 JILL KATH WIRTH
7 4 MARIANA PO KIL 50 MARIANA A PO
8 4 MARIANA A PO LPI 55 MARIANA A PO
9 5 BRET LLC 52 BRET
To overwrite Name with the new values, just change Name2 to Name.
A base R solution would be to sort based on the full name and replace. The final step is the gsub that removes the (full name)
gsub('\\(.*', '', with(df[order(df$ID,
gsub("[\\(\\)]", "", regmatches(df$Name, gregexpr("\\(.*?\\)",
df$Name)))),], ave(Name, ID, FUN = function(i) `<-`(i, tail(i, 1)))))
#[1] "Dave" "PAT ERIN DAV" "PAT ERIN DAV" "JILL KATH WIRTH" "JILL KATH WIRTH" "JILL KATH WIRTH" "MARIANA A PO" "MARIANA A PO"
#[9] "BRET"
A solution uses functions from dplyr and tidyr. It fills the Name using the last one of each ID. dt2 is the final output.
If (full name) is truly in your data frame and you want to remove it, then we can use gsub and regular expression to do that. dt3 is the final output.
# Load packages
library(dplyr)
library(tidyr)
# Create example data frames
dt <- read.table(text = "ID Name Company Tips
1 Dave AB 50
2 'PAT E DAV' ABC 15
2 'PAT ERIN DAV(full name)' AB 26
3 'JIL WIRTH' DFG 26
3 'JIL K WIRTH' EF 45
3 'JILL KATH WIRTH(full name)' JUI 85
4 'MARIANA PO' KIL 50
4 'MARIANA A PO(full name)' LPI 55
5 'BRET' LLC 52",
header = TRUE, stringsAsFactors = FALSE)
dt2 <- dt %>%
group_by(ID) %>%
# Replace names that are not on the last row of each ID to be NA
mutate(Name = ifelse(row_number() != n(), NA, Name)) %>%
# Fill NA with the name from the last row
fill(Name, .direction = "up")
# Remove the string (full name)
dt3 <- dt2 %>% mutate(Name = gsub("\\s*\\([^\\)]+\\)", "", Name))

How to gather series of columns with data into rows [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 5 years ago.
I'm just trying to get my head around tidying my data and I have this problem:
I have data as follows:
ID Tx1 Tx1Date Tx1Details Tx2 Tx2Date Tx2Details Tx3 Tx1Date Tx1Details
1 14 12/3/14 blabla 1e 12/5/14 morebla r 14/2/14 grrr
2 23 14/5/16 albalb 342 1/4/5 teeee s 5/6/17 purrr
I want the data to be in the format
ID Tx TxDate TxDetails
1 14 12/3/14 blabla
1 1e 12/5/14 morebla
1 r 14/2/14 grrr
2 23 14/5/16 albalb
2 342 1/4/5 teeee
2 s 5/6/17 purrr
I have used
library(tidyr)
library(dplyr)
NewData<-mydata %>% gather(key, value, "ID", 2:10)
but I'm not sure how to rename the columns as per the intended output to see if this will work
You can rename your data frame column names to a more conventional separable names and then use the base reshape function, assuming your initial data frames looks like this(changed the last two column names to Tx3Date and Tx3Details as otherwise they are duplicates of columns 4 and 5):
df
# ID Tx1 Tx1Date Tx1Details Tx2 Tx2Date Tx2Details Tx3 Tx3Date Tx3Details
#1 1 14 12/3/14 blabla 1e 12/5/14 morebla r 14/2/14 grrr
#2 2 23 14/5/16 albalb 342 1/4/5 teeee s 5/6/17 purrr
names(df) <- gsub("(\\d)(\\w*)", "\\2\\.\\1", names(df))
df
# ID Tx.1 TxDate.1 TxDetails.1 Tx.2 TxDate.2 TxDetails.2 Tx.3 TxDate.3 TxDetails.3
#1 1 14 12/3/14 blabla 1e 12/5/14 morebla r 14/2/14 grrr
#2 2 23 14/5/16 albalb 342 1/4/5 teeee s 5/6/17 purrr
reshape(df, varying = 2:10, idvar = "ID", dir = "long")
# ID time Tx TxDate TxDetails
#1.1 1 1 14 12/3/14 blabla
#2.1 2 1 23 14/5/16 albalb
#1.2 1 2 1e 12/5/14 morebla
#2.2 2 2 342 1/4/5 teeee
#1.3 1 3 r 14/2/14 grrr
#2.3 2 3 s 5/6/17 purrr
Drop the redundant time variable if you don't need it.
The data.table package handles this pretty well.
library(data.table)
setDT(df)
melt(df, measure = list(Tx = grep("^Tx[0-3]$", names(df)),
Date = grep("Date", names(df)),
Details = grep("Details", names(df))),
value.name = c("Tx", "TxDate", "TxDetails"))
Or more concisely
melt(df, measure = patterns("^Tx[0-3]$", "Date", "Details"),
value.name = c("Tx", "TxDate", "TxDetails"))

Resources