Splitting coloumn with differing syntax in R - r

I am having some trouble cleaning up my data. It consists of a list of sold houses. It is made up of the sell price, no. of rooms, m2 and the address.
As seen below the address is in one string.
Head(DF, 3)
Address Price m2 Rooms
Petersvej 1772900 Hoersholm 10.000 210 5
Annasvej 2B2900 Hoersholm 15.000 230 4
Krænsvej 125800 Lyngby C 10.000 210 5
A Mivs Alle 119800 Hjoerring 1.300 70 3
The syntax for the address coloumn is: road name, road no., followed by a 4 digit postalcode and the city name(sometimes two words).
Also need to extract the postalcode.. been looking at 'stringi' package haven't been able to find any examples..
any pointers are very much appreciated

1) Using separate in tidyr separate the subfields of Address into 3 fields merging anything left over into the last and then use separate again to split off the last 4 digits in the Number column that was generated in the first separate.
library(dplyr)
library(tidyr)
DF %>%
separate(Address, into = c("Road", "Number", "City"), extra = "merge") %>%
separate(Number, into = c("StreetNo", "Postal"), sep = -4)
giving:
Road StreetNo Postal City Price m2 Rooms CITY
1 Petersvej 77 2900 Hoersholm 10 210 5 Hoersholm
2 Annasvej 121B 2900 Hoersholm 15 230 4 Hoersholm
3 Krænsvej 12 5800 Lyngby C 10 210 5 C
2) Alternately, insert commas between the subfields of Address and then use separate to split the subfields out. It gives the same result as (1) on the input shown in the Note below.
DF %>%
mutate(Address = sub("(\\S.*) +(\\S+)(\\d{4}) +(.*)", "\\1,\\2,\\3,\\4", Address)) %>%
separate(Address, into = c("Road", "Number", "Postal", "City"), sep = ",")
Note
The input DF in reproducible form is:
DF <-
structure(list(Address = structure(c(3L, 1L, 2L), .Label = c("Annasvej 121B2900 Hoersholm",
"Krænsvej 125800 Lyngby C", "Petersvej 772900 Hoersholm"), class = "factor"),
Price = c(10, 15, 10), m2 = c(210L, 230L, 210L), Rooms = c(5L,
4L, 5L), CITY = structure(c(2L, 2L, 1L), .Label = c("C",
"Hoersholm"), class = "factor")), class = "data.frame", row.names = c(NA,
-3L))
Update
Added and fixed (2).

Check out the cSplit function from the splitstackshape package
library(splitstackshape)
df_new <- cSplit(df, splitCols = "Address", sep = " ")
#This will split your address column into 4 different columns split at the space
#you can then add an ifelse block to combine the last 2 columns to make up the city like
df_new$City <- ifelse(is.na(df_new$Address_4), as.character(df_new$Address_3), paste(df_new$Address_3, df_new$Address_4, sep = " "))

One way to do this is with regex.
In this instance you may use a simple regular expression which will match all alphabetical characters and space characters which lead to the end of the string, then trim the whitespace off.
library(stringr)
DF <- data.frame(Address=c("Petersvej 772900 Hoersholm",
"Annasvej 121B2900 Hoersholm",
"Krænsvej 125800 Lyngby C"))
DF$CITY <- str_trim(str_extract(DF$Address, "[a-zA-Z ]+$"))
This will give you the following output:
Address CITY
1 Petersvej 772900 Hoersholm Hoersholm
2 Annasvej 121B2900 Hoersholm Hoersholm
3 Krænsvej 125800 Lyngby C Lyngby C
In R the stringr package is preferred for regex because it allows for multiple-group capture, which in this example could allow you to separate each component of the address with one expression.

Related

Using R Base to sum a column of a dataframe for each value of a list

I have a dataframe named 2022_Rev that looks sort of like this:
Name Vendor Sales
Steve 6 80,000
Annie 4 95,000
Bill 6 45,000
Steve 3 25,000
Bill 2 40,000
Sam 5 5,000
... ... ...
I also have a list of each sales person:
Employees ['Steve', 'Annie', 'Bill', 'Sam', ...]
I want to apply mean() to column sales for each item in the list "Employee". I am supposed to use base R to create a loop that goes through each value in "Employees" and then creates a vector showing the mean for each employee. So far I have:
avgSales = rep(NA, 10)
for (i in length(Employees)){
if(Employees[i] == 2022_Rev$Name){
avgSales[i] = mean(2022_Rev$Sales[i])
}
}
This is erroring apparently because if can only check one value? I'm not sure how to fix it.
This is not normally the approach we would take in R (i.e. there are better ways to get the mean of a column by group). However, if you want an example of a for loop over the names of the Employees in your list, here is one base R approach. First preallocated a named vector of length as long as your Employees, and then fill it use a for loop:
sales_means = setNames(vector("numeric", length = length(Employees)), Employees)
for(e in Employees) {
sales_means[e] = mean(`2022_Rev`[`2022_Rev`$Name==e, "Sales"],na.rm=T)
}
Output:
Steve Annie Bill Sam
52500 95000 42500 5000
Input:
`2022_Rev` = structure(list(Name = c("Steve", "Annie", "Bill", "Steve", "Bill",
"Sam"), Vendor = c(6L, 4L, 6L, 3L, 2L, 5L), Sales = c(80000L,
95000L, 45000L, 25000L, 40000L, 5000L)), row.names = c(NA, -6L
), class = "data.frame")
Employees = list('Steve', 'Annie', 'Bill', 'Sam')
We could use the subset option in aggregate from base R
aggregate(Sales ~ Name, data = `2022_Rev`, subset = Name %in% Employees, mean)
Name Sales
1 Annie 95000
2 Bill 42500
3 Sam 5000
4 Steve 52500
We can use aggregate to calculate the mean of Sales with respect to Name , then transform your list Employees to data.frame then merge it with the aggregate result to get the values in the list
aggregate(Sales ~ Name , `2022_Rev` , mean) |>
merge(do.call(rbind , Employees) |>
data.frame(Name = _) , by.y = "Name")
Output
Name Sales
1 Annie 95000
2 Bill 42500
3 Sam 5000
4 Steve 52500

Removing all characters before and after text in R, then creating columns from the new text

So I have a string that I'm attempting to parse through and then create 3 columns with the data I extract. From what I've seen, stringr doesn't really cover this case and the gsub I've used so far is excessive and involves me making multiple columns, parsing from those new columns, and then removing them and that seems really inefficient.
The format is this:
"blah, grabbed by ???-??-?????."
I need this:
???-??-?????
I've used placeholders here, but this is how the string typically looks
"blah, grabbed by PHI-80-J.Matthews."
or
"blah, grabbed by NE-5-J.Mills."
and sometimes there is text after the name like this:
"blah, grabbed by KC-10-T.Hill. Blah blah blah."
This is what I would like the end result to be:
Place
Number
Name
PHI
80
J.Matthews
NE
5
J.Mills
KC
10
T. Hill
Edit for further explanation:
Most strings include other people in the same format so "downed by" needs to be incorporated in someway to make sure it is grabbing the right name.
Ex.
"Throw by OAK-4-D.Carr, snap by PHI-62-J.Kelce, grabbed by KC-10-T.Hill. Penalty on OAK-4-D.Carr"
Desired Output:
Place
Number
Name
KC
10
T. Hill
This solution simply extract the components based on the logic OP mentioned i.e. capture the characters that are needed as three groups - 1) one or more upper case letter ([A-Z]+) followed by a dash (-), 2) then one or more digits (\\d+), and finally 3) non-whitespace characters (\\S+) that follow the dash
library(tidyr)
extract(df1, col1, into = c("Place", "Number", "Name"),
".*grabbed by\\s([A-Z]+)-(\\d+)-(\\S+)\\..*", convert = TRUE)
-ouputt
# A tibble: 4 x 3
Place Number Name
<chr> <int> <chr>
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
4 KC 10 T.Hill
Or do this in base R
read.table(text = sub(".*grabbed by\\s((\\w+-){2}\\S+)\\..*", "\\1",
df1$col1), header = FALSE, col.names = c("Place", "Number", "Name"), sep='-')
Place Number Name
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
data
df1 <- structure(list(col1 = c("blah, grabbed by PHI-80-J.Matthews.",
"blah, grabbed by NE-5-J.Mills.", "blah, grabbed by KC-10-T.Hill. Blah blah blah.",
"Throw by OAK-4-D.Carr, snap by PHI-62-J.Kelce, grabbed by KC-10-T.Hill. Penalty on OAK-4-D.Carr"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
This solution actually does what you say in the title, namely first remove the text around the the target substring, then split it into columns:
library(tidyr)
library(stringr)
df1 %>%
mutate(col1 = str_extract(col1, "\\w+-\\w+-\\w\\.\\w+")) %>%
separate(col1,
into = c("Place", "Number", "Name"),
sep = "-")
# A tibble: 3 x 3
Place Number Name
<chr> <chr> <chr>
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
Here, we make use of the fact that the character class \\w is for letters irrespective of case and for digits (and also for the underscore).
Here is an alternative way using sub with regex "([A-Za-z]+\\.[A-Za-z]+).*", "\\1" that removes the string after the second point.
separate that splits the string by by, and finally again separate to get the desired columns.
library(dplyr)
library(tidyr)
df1 %>%
mutate(test1 = sub("([A-Za-z]+\\.[A-Za-z]+).*", "\\1", col1)) %>%
separate(test1, c('remove', 'keep'), sep = " by ") %>%
separate(keep, c("Place", "Number", "Name"), sep = "-") %>%
select(Place, Number, Name)
Output:
Place Number Name
<chr> <chr> <chr>
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill

Use extract and/or separate to isolate variable string from dataframe

I've looked through the following pages on using regex to isolate a string:
Regular expression to extract text between square brackets
What is a non-capturing group? What does (?:) do?
Split data frame string column into multiple columns
I have a dataframe which contains protein/gene identifiers, and in some cases there are two or more of these strings (seperated by a comma) because of multiple matches from a list. In this case the first string is the strongest match and I'm not necessarily interested in keeping the rest.They represent multiple matches from inferred evidence and when they cannot be easily discriminated all of the hits get put into a column. In this case I'm only interested in keeping the first because the group will likely have the same type of annotation (i.e. type of protein, gene ontology, similar function etc) If I split the multiple entries into more rows then it would appear that I have evidence that they exist in my dataset, but at the empirical level I don't.
My dataframe:
protein
1 sp|P50213|IDH3A_HUMAN
2 sp|Q9BZ95|NSD3_HUMAN
3 sp|Q92616|GCN1_HUMAN
4 sp|Q9NSY1|BMP2K_HUMAN
5 sp|O75643|U520_HUMAN
6 sp|O15357|SHIP2_HUMAN
523 sp|P10599|THIO_HUMAN,sp|THIO_HUMAN|
524 sp|Q96KB5|TOPK_HUMAN
525 sp|P12277|KCRB_HUMAN,sp|P17540|KCRS_HUMAN,sp|P12532|KCRU_HUMAN
526 sp|O00299|CLIC1_HUMAN
527 sp|P25940|CO5A3_HUMAN
The output I am trying to create:
uniprot gene
P50213 IDH3A
Q9BZ95 NSD3
Q92616 GCN1
P12277 KCRB
I'm trying to use extract and separate functions to do this:
extract(df, protein, into = c("uniprot", "gene"), regex = c("sp|(.*?)|","
(.*?)_"), remove = FALSE)
results in:
Error: is_string(regex) is not TRUE
trying separate to at least break apart the two in multiple steps:
separate(df, protein, into = c("uniprot", "gene"), sep = "|", remove =
FALSE)
results in:
Warning message:
Expected 2 pieces. Additional pieces discarded in 528 rows [1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
protein uniprot gene
1 sp|P50213|IDH3A_HUMAN s
2 sp|Q9BZ95|NSD3_HUMAN s
3 sp|Q92616|GCN1_HUMAN s
4 sp|Q9NSY1|BMP2K_HUMAN s
5 sp|O75643|U520_HUMAN s
6 sp|O15357|SHIP2_HUMAN s
What is the best way to use regex in this scenario and are extract or separate the best way to go about this? Any suggestion would be greatly appreciated. Thanks!
Update based on feedback:
df <- structure(list(protein = c("sp|P50213|IDH3A_HUMAN", "sp|Q9BZ95|NSD3_HUMAN",
"sp|Q92616|GCN1_HUMAN", "sp|Q9NSY1|BMP2K_HUMAN", "sp|O75643|U520_HUMAN",
"sp|O15357|SHIP2_HUMAN", "sp|P10599|THIO_HUMAN,sp|THIO_HUMAN|",
"sp|Q96KB5|TOPK_HUMAN", "sp|P12277|KCRB_HUMAN,sp|P17540|KCRS_HUMAN,sp|P12532|KCRU_HUMAN",
"sp|O00299|CLIC1_HUMAN")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "523", "524", "525", "526"))
df1 <- separate(df, protein, into = "protein", sep = ",")
#i'm only interested in the first match, because science
df2 <- extract(df1, protein, into = c("uniprot", "gene"), regex = "sp\\|
([^|]+)\\|([^_]+)", remove = FALSE)
#create new columns with uniprot code and gene id, no _HUMAN
#df2
# protein uniprot gene
#1 sp|P50213|IDH3A_HUMAN P50213 IDH3A
#2 sp|Q9BZ95|NSD3_HUMAN Q9BZ95 NSD3
#3 sp|Q92616|GCN1_HUMAN Q92616 GCN1
#4 sp|Q9NSY1|BMP2K_HUMAN Q9NSY1 BMP2K
#5 sp|O75643|U520_HUMAN O75643 U520
#6 sp|O15357|SHIP2_HUMAN O15357 SHIP2
#523 sp|P10599|THIO_HUMAN P10599 THIO
#524 sp|Q96KB5|TOPK_HUMAN Q96KB5 TOPK
#525 sp|P12277|KCRB_HUMAN P12277 KCRB
#526 sp|O00299|CLIC1_HUMAN O00299 CLIC1
#and the answer using %>% pipes (this is what I aspire to)
df_filtered <- df %>%
separate(protein, into = "protein", sep = ",") %>%
extract(protein, into = c("uniprot", "gene"), regex = "sp\\|([^|]+)\\|([^_]+)") %>%
select(uniprot, gene)
#df_filtered
# uniprot gene
#1 P50213 IDH3A
#2 Q9BZ95 NSD3
#3 Q92616 GCN1
#4 Q9NSY1 BMP2K
#5 O75643 U520
#6 O15357 SHIP2
#523 P10599 THIO
#524 Q96KB5 TOPK
#525 P12277 KCRB
#526 O00299 CLIC1
We can capture the pattern as a group ((...)) in extract. Here, we match sp at the beginning (^) of the string followed by a | (metacharacter - escaped \\), followed by one or more characters not a | captured as a group, followed by a | and the second set of characters captured
library(tidyverse)
extract(df, protein, into = c("uniprot", "gene"),
regex = "^sp\\|([^|]+)\\|([^|]+).*")
If there are multiple instances of 'sp', then separate the rows into long format with separate_rows and then use extract
df %>%
separate_rows(protein, sep=",") %>%
extract(protein, into = c("uniprot", "gene"),
"^sp\\|([^|]+)\\|([^|]*).*")
There is one instance where there is only two sets of words. To make it working
df %>%
separate_rows(protein, sep=",") %>%
extract(protein, into = "gene", "([^|]*HUMAN)", remove = FALSE) %>%
mutate(uniprot = str_extract(protein, "(?<=sp\\|)[^_]+(?=\\|)")) %>%
select(uniprot, gene)
# uniprot gene
#1 P50213 IDH3A_HUMAN
#2 Q9BZ95 NSD3_HUMAN
#3 Q92616 GCN1_HUMAN
#4 Q9NSY1 BMP2K_HUMAN
#5 O75643 U520_HUMAN
#6 O15357 SHIP2_HUMAN
#7 P10599 THIO_HUMAN
#8 <NA> THIO_HUMAN
#9 Q96KB5 TOPK_HUMAN
#10 P12277 KCRB_HUMAN
#11 P17540 KCRS_HUMAN
#12 P12532 KCRU_HUMAN
#13 O00299 CLIC1_HUMAN
data
df <- structure(list(protein = c("sp|P50213|IDH3A_HUMAN", "sp|Q9BZ95|NSD3_HUMAN",
"sp|Q92616|GCN1_HUMAN", "sp|Q9NSY1|BMP2K_HUMAN", "sp|O75643|U520_HUMAN",
"sp|O15357|SHIP2_HUMAN", "sp|P10599|THIO_HUMAN,sp|THIO_HUMAN|",
"sp|Q96KB5|TOPK_HUMAN", "sp|P12277|KCRB_HUMAN,sp|P17540|KCRS_HUMAN,sp|P12532|KCRU_HUMAN",
"sp|O00299|CLIC1_HUMAN")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "523", "524", "525", "526"))

Comparing pairs of rows in a list of data frames

I have a list that's 1314 element long. Each element is a data frame consisting of two rows and four columns.
Game.ID Team Points Victory
1 201210300CLE CLE 94 0
2 201210300CLE WAS 84 0
I would like to use the lapply function to compare points for each team in each game, and change Victory to 1 for the winning team.
I'm trying to use this function:
test_vic <- lapply(all_games, function(x) {if (x[1,3] > x[2,3]) {x[1,4] = 1}})
But the result it produces is a list 1314 elements long with just the Game ID and either a 1 or a null, a la:
$`201306200MIA`
[1] 1
$`201306160SAS`
NULL
How can I fix my code so that each data frame maintains its shape. (I'm guessing solving the null part involves if-else, but I need to figure out the right syntax.)
Thanks.
Try
lapply(all_games, function(x) {x$Victory[which.max(x$Points)] <- 1; x})
Or another option would be to convert the list to data.table by using rbindlist and then do the conversion
library(data.table)
rbindlist(all_games)[,Victory:= +(Points==max(Points)) ,Game.ID][]
data
all_games <- list(structure(list(Game.ID = c("201210300CLE",
"201210300CLE"
), Team = c("CLE", "WAS"), Points = c(94L, 84L), Victory = c(0L,
0L)), .Names = c("Game.ID", "Team", "Points", "Victory"),
class = "data.frame", row.names = c("1",
"2")), structure(list(Game.ID = c("201210300CME", "201210300CME"
), Team = c("CLE", "WAS"), Points = c(90, 92), Victory = c(0L,
0L)), .Names = c("Game.ID", "Team", "Points", "Victory"),
row.names = c("1", "2"), class = "data.frame"))
You could try dplyr:
library(dplyr)
all_games %>%
bind_rows() %>%
group_by(Game.ID) %>%
mutate(Victory = row_number(Points)-1)
Which gives:
#Source: local data frame [4 x 4]
#Groups: Game.ID
#
# Game.ID Team Points Victory
#1 201210300CLE CLE 94 1
#2 201210300CLE WAS 84 0
#3 201210300CME CLE 90 0
#4 201210300CME WAS 92 1

split dataset by day and save it as data frame

I have a dataset with 2 months of data (month of Feb and March). Can I know how can I split the data into 59 subsets of data by day and save it as data frame (28 days for Feb and 31 days for Mar)? Preferably to save the data frame in different name according to the date, i.e. 20140201, 20140202 and so forth.
df <- structure(list(text = structure(c(4L, 6L, 5L, 2L, 8L, 1L), .Label = c(" Terpilih Jadi Maskapai dengan Pelayanan Kabin Pesawat cont",
"booking number ZEPLTQ I want to cancel their flight because they can not together with my wife and kids",
"Can I change for the traveler details because i choose wrongly for the Mr or Ms part",
"cant do it with cards either", "Coming back home AK", "gotta try PNNL",
"Jadwal penerbangan medanjktsblm tangalmasi ada kah", "Me and my Tart would love to flyLoveisintheAir",
"my flight to Bangkok onhas been rescheduled I couldnt perform seat selection now",
"Pls checks his case as money is not credited to my bank acctThanks\n\nCASLTP",
"Processing fee Whatt", "Tacloban bound aboardto get them boats Boats boats boats Tacloban HeartWork",
"thanks I chatted with ask twice last week and told the same thing"
), class = "factor"), created = structure(c(1L, 1L, 2L, 2L, 3L,
3L), .Label = c("1/2/2014", "2/2/2014", "5/2/2014", "6/2/2014"
), class = "factor")), .Names = c("text", "created"), row.names = c(NA,
6L), class = "data.frame")
You don't need to output multiple dataframes. You only need to select/subset them by year&month of the 'created' field. So here are two ways do do that: 1. is simpler if you don't plan on needing any more date-arithmetic
# 1. Leave 'created' a string, just use text substitution to extract its month&date components
df$created_mthyr <- gsub( '([0-9]+/)[0-9]+/([0-9]+)', '\\1\\2', df$created )
# 2. If you need to do arbitrary Date arithmetic, convert 'created' field to Date object
# in this case you need an explicit format-string
df$created <- as.Date(df$created, '%M/%d/%Y')
# Now you can do either a) split
split(df, df$created_mthyr)
# specifically if you want to assign the output it creates to 3 dataframes:
df1 <- split(df, df$created_mthyr)[[1]]
df2 <- split(df, df$created_mthyr)[[2]]
df5 <- split(df, df$created_mthyr)[[3]]
# ...or else b) do a Split-Apply-Combine and perform arbitrary command on each separate subset. This is very powerful. See plyr/ddply documentation for examples.
require(plyr)
df1 <- dlply(df, .(created_mthyr))[[1]]
df2 <- dlply(df, .(created_mthyr))[[2]]
df5 <- dlply(df, .(created_mthyr))[[3]]
# output looks like this - strictly you might not want to keep 'created','created_mthyr':
> df1
# text created created_mthyr
#1 cant do it with cards either 1/2/2014 1/2014
#2 gotta try PNNL 1/2/2014 1/2014
> df2
#3
#Coming back home AK
#4 booking number ZEPLTQ I want to cancel their flight because they can not together with my wife and kids
# created created_mthyr
#3 2/2/2014 2/2014
#4 2/2/2014 2/2014

Resources