Extract|Grep|Substring character vector in R - r

String that start with ^passport only those entry need to be captured
example :
entry = c("passport AR4133553 expires 11 mar 2019","passport 472420180","passport 563220533 (korea, north)",
"passport iraq","passport m 788439","following data derived from an eritrean passport issued",
"passport and national")
desired output : Data has to capture only the passport and country name
**passport** **passport_country**
"AR4133553" NA
"472420180" NA
"563220533" "korea, north"
NA "iraq"
"788439" NA
NA NA
NA NA

#sample data
entry = c("passport AR4133553 expires 11 mar 2019",
"passport 472420180",
"passport 563220533 (korea, north)",
"passport iraq",
"passport m 788439",
"following data derived from an eritrean passport issued",
"passport and national")
#fetch passport number from sample data (i.e. second string having numbers which is immediately after 'passport')
passport_no <- gsub("^passport\\s((([a-zA-Z]*\\d)|(\\d[a-zA-Z]*))\\S*).*", "\\1", entry, perl=T)
ind <- grep("^passport\\s((([a-zA-Z]*\\d)|(\\d[a-zA-Z]*))\\S*).*", entry, value=F)
passport_no[-ind] <- NA
#fetch passport country from sample data
library(maptools)
data(wrld_simpl)
passport_country <- lapply(gsub("[()]","",entry), function(x)
as.character(wrld_simpl#data$NAME[sapply(wrld_simpl#data$NAME, grepl, x, ignore.case=T)]))
passport_country <- lapply(passport_country, function(x)
if(identical(x, character(0))) NA_character_ else x)
#note that 'Korea, North' is not selected in above comparison as it's offical country name is 'Korea, Democratic People's Republic of'
#final data
df <- data.frame(cbind(passport_no, passport_country))
df
Output is:
passport_no passport_country
1 AR4133553 NA
2 472420180 NA
3 563220533 NA
4 NA Iraq
5 NA NA
6 NA Eritrea
7 NA NA

Related

Skip na_interpolation on dplyr group/variable pairs with full NAs in R

I have a data frame that looks like this:
Country Year acnt_class wages
3 AZE 2010 NA NA
4 AZE 2011 0.4206776 NA
5 AZE 2012 NA NA
6 AZE 2013 NA NA
7 AZE 2014 0.7735889 0.4273174
8 AZE 2015 NA NA
9 AZE 2016 NA NA
10 AZE 2017 0.5108674 0.4335978
11 AZE 2018 NA NA
15 BDI 2010 NA NA
16 BDI 2011 0.3140646 NA
17 BDI 2012 NA NA
18 BDI 2013 NA NA
19 BDI 2014 0.1224175 NA
20 BDI 2015 NA NA
21 BDI 2016 NA NA
22 BDI 2017 NA NA
23 BDI 2018 NA NA
27 BEL 2010 NA NA
28 BEL 2011 0.9576057 NA
29 BEL 2012 NA NA
30 BEL 2013 NA NA
31 BEL 2014 1.0083120 0.9623492
32 BEL 2015 NA NA
33 BEL 2016 NA NA
34 BEL 2017 1.0036910 0.9499486
35 BEL 2018 NA NA
I'm trying to run this function to use stine interpolation to fill in missing NAs by group across both variable columns "acnt_class" and "wages":
DF <- DF %>%
group_by(Country) %>%
mutate_at(.vars = c("acnt_class", "wages"),
.funs = ~na_interpolation(., option = "stine"))
It works whenever I run it on columns where there are at least two observations for each group, however, here, I run into this error:
Error in na_interpolation(., option = "stine") :
Input data needs at least 2 non-NA data point for applying na_interpolation
Due to the group "BDI" having full NAs for the variable "wages".
Ideally, I'm looking for a modified function that will "skip" group/variable pairs with full NAs/1 observation and leave them as they were. Solutions? Thanks!
Found a solution:
for only interpolation:
library(TSimpute)
library(dplyr)
library(zoo)
DF <- DF %>%
group_by(Country) %>%
mutate_at(vars(acnt_class, wages), funs(if(sum(!is.na(.))<2) {.} else{replace(na_interpolation(., option = "stine"), is.na(na.approx(., na.rm=FALSE)), NA)}))
The answer provided by TiberiusGracchus2020 works well. In case it is helpful to anyone, I have turned that code snippet into a function with a lot of comments to make it clearer what's happening at each stage.
# Modify imputeTS::na_interpolate function
# (1) doesn't break on all NA vectors
# (2) won't impute leading and lagging NAs
na_interpolation2 <- function(x, option = "linear") {
library(TSimpute)
library(dplyr)
total_not_missing <- sum(!is.na(x))
# check there is sufficient data for na_interpolation
if(total_not_missing < 2) {x}
else
# replace takes an input vector, a T/F vector & replacement value
{replace(
# input vector is interpolated data
# this will impute leading/lagging NAs which we don't want
imputeTS::na_interpolation(x, option = option),
# create T/F vector for NAs,
is.na(na.approx(x, na.rm = FALSE)),
# replace TRUE with NA in input vector
NA)
}
}
# example data
data1 <- c(NA, NA, NA, NA, NA)
data2 <- c(NA, NA, 1, NA, 3, NA)
na_interpolation(data1)
# Error in na_interpolation(data1) : Input data needs at
# least 2 non-NA data point for applying na_interpolation
na_interpolation(data2)
# [1] 1 1 1 2 3 3
na_interpolation2(data1)
# [1] NA NA NA NA NA
na_interpolation2(data2)
# [1] NA NA 1 2 3 NA

Box plot for one row of a (frequency) table

I have a data set as a .csv file (basically: people's wine choice in relation to the origin of the ambient music playing). Reading this as a dataframe results in a df looking like this:
Music Wine
1 French French
2 Italian French
3 None Italian
4 Italian Italian
5 French Other
...
As a table, it looks like this:
Wine
Music Other French Italian
French 35 39 1
None 43 30 11
Italian 35 30 19
Now I want to create a frequency diagram that ONLY plots the relative distribution of purchases made with Music == "None". So basically I'd get Other = 0.511904, French = 0.3571429, Italian = 0.1309524.
Now my problem is subsetting this table isn't working.
noMusic <- prop.table(table(data[data$Music == "None"]))
geenMuziekTabel <- prop.table(table(data[data$Music == "None"]))
Both result in this:
[1] 0.144032922 0.004115226 0.045267490 0.078189300 NA NA NA NA
[9] NA NA NA NA NA NA NA NA
[17] NA NA NA NA NA NA NA NA
[25] NA NA NA NA NA NA NA NA
[33] NA NA NA NA NA NA NA NA
[41] NA NA NA NA NA NA NA NA
[49] NA NA NA NA NA NA NA NA
[57] NA NA NA NA NA NA NA NA
[65] NA NA NA NA NA NA NA NA
[73] NA NA NA NA NA NA NA NA
[81] NA NA NA NA
I thought: maybe I should subset my dataframe FIRST and THEN make a proportional table out of it, but R seems to remember that there was other data, and make this table:
Wine
Music Other French Italian
French 0 0 0
None 43 30 11
Italian 0 0 0
I've tried a number of different things, too, but can't figure it out. Would anyone know what I'm doing wrong?
Edit: the solution, based on the accepted answer, is as follows:
noMusicTable <- prop.table(table(musicwine$Wine[musicwine$Music == "None"]))
#noMusicTable <- prop.table(table(subset(musicwine, Music == "None", select = Wine)))
noMusicDF <- as.data.frame(noMusicTable)
# need to declare x and y explicitly; use stat = 'identity' to map bars to y-variable
ggplot(noMusicDF, mapping = aes(x = Var1, y = Freq)) + geom_bar(stat = 'identity', fill='red')
Here three ways to subset correctly:
dat <- read.table(text =
"Music Wine
French French
Italian French
None Italian
Italian Italian
French Other", header = TRUE)
# Two different ways to subset
prop.table(table(dat$Wine[dat$Music == "None"]))
prop.table(table(subset(dat, Music == "None", select = Wine)))
# With dplyr and piping
library(dplyr)
dat %>%
filter(Music == "None") %>%
select(Wine) %>%
table() %>%
prop.table()

R Filling missing values with NA for a data frame

I am currently trying to create a data-frame with the following lists
location <- list("USA","Singapore","UK")
organization <- list("Microsoft","University of London","Boeing","Apple")
person <- list()
date <- list("1989","2001","2018")
Jobs <- list("CEO","Chairman","VP of sales","General Manager","Director")
When I try and create a data-frame I get the (obvious) error that the lengths of the lists are not equal. I want to find a way to either make the lists the same length, or fill the missing data-frame entries with "NA". After doing some searching I have not been able to find a solution
Here are purrr (part of tidyverse) and base R solutions, assuming you just want to fill remaining values in each list with NA. I'm taking the maximum length of any list as len, then for each list doing rep(NA) for the difference between the length of that list and the maximum length of any list.
library(tidyverse)
location <- list("USA","Singapore","UK")
organization <- list("Microsoft","University of London","Boeing","Apple")
person <- list()
date <- list("1989","2001","2018")
Jobs <- list("CEO","Chairman","VP of sales","General Manager","Director")
all_lists <- list(location, organization, person, date, Jobs)
len <- max(lengths(all_lists))
With purrr::map_dfc, you can map over the list of lists, tack on NAs as needed, convert to character vector, then get a data frame of all those vectors cbinded in one piped call:
map_dfc(all_lists, function(l) {
c(l, rep(NA, len - length(l))) %>%
as.character()
})
#> # A tibble: 5 x 5
#> V1 V2 V3 V4 V5
#> <chr> <chr> <chr> <chr> <chr>
#> 1 USA Microsoft NA 1989 CEO
#> 2 Singapore University of London NA 2001 Chairman
#> 3 UK Boeing NA 2018 VP of sales
#> 4 NA Apple NA NA General Manager
#> 5 NA NA NA NA Director
In base R, you can lapply the same function across the list of lists, then use Reduce to cbind the resulting lists and convert it to a data frame. Takes two steps instead of purrr's one:
cols <- lapply(all_lists, function(l) c(l, rep(NA, len - length(l))))
as.data.frame(Reduce(cbind, cols, init = NULL))
#> V1 V2 V3 V4 V5
#> 1 USA Microsoft NA 1989 CEO
#> 2 Singapore University of London NA 2001 Chairman
#> 3 UK Boeing NA 2018 VP of sales
#> 4 NA Apple NA NA General Manager
#> 5 NA NA NA NA Director
For both of these, you can now set the names however you like.
You could do:
data.frame(sapply(dyem_list, "length<-", max(lengths(dyem_list))))
location organization person date Jobs
1 USA Microsoft NULL 1989 CEO
2 Singapore University of London NULL 2001 Chairman
3 UK Boeing NULL 2018 VP of sales
4 NULL Apple NULL NULL General Manager
5 NULL NULL NULL NULL Director
Where dyem_list is the following:
dyem_list <- list(
location = list("USA","Singapore","UK"),
organization = list("Microsoft","University of London","Boeing","Apple"),
person = list(),
date = list("1989","2001","2018"),
Jobs = list("CEO","Chairman","VP of sales","General Manager","Director")
)

Split text string into column based on variable

I have a dataframe with a text column that I would like to split into multiple columns since the text string contains multiple variables, such a location, education, distance etc.
Dataframe:
text.string = c("&location=NY&distance=30&education=University",
"&location=CA&distance=30&education=Highschool&education=University",
"&location=MN&distance=10&industry=Healthcare",
"&location=VT&distance=30&education=University&industry=IT&industry=Business")
df = data.frame(text.string)
df
text.string
1 &location=NY&distance=30&education=University
2 &location=CA&distance=30&education=Highschool&education=University
3 &location=MN&distance=10&industry=Healthcare
4 &location=VT&distance=30&education=University&industry=IT&industry=Business
I can split this using cSplit: cSplit(df, 'text.string', sep = "&"):
text.string_1 text.string_2 text.string_3 text.string_4 text.string_5 text.string_6
1: NA location=NY distance=30 education=University NA NA
2: NA location=CA distance=30 education=Highschool education=University NA
3: NA location=MN distance=10 industry=Healthcare NA NA
4: NA location=VT distance=30 education=University industry=IT industry=Business
Problem is that the text string may contain a multiple of the same variable, or some miss a certain variable. With cSplit the grouping of the variables per column become all mixed up. I would like to avoid this, and group them together.
So it would like similar to this (education and industry do not appear in multiple columns anymore):
text.string_1 text.string_2 text.string_3 text.string_4 text.string_5 text.string_6
1 NA location=NY distance=30 education=University <NA> NA
2 NA location=CA distance=30 education=Highschool education=University <NA> NA
3 NA location=MN distance=10 <NA> industry=Healthcare NA
4 NA location=VT distance=30 education=University industry=IT industry=Business NA
Taking into account #NicE comment:
This is one way, following your example:
library(data.table)
text.string = c("&location=NY&distance=30&education=University",
"&location=CA&distance=30&education=Highschool&education=University",
"&location=MN&distance=10&industry=Healthcare",
"&location=VT&distance=30&education=University&industry=IT&industry=Business")
clean <- strsplit(text.string, "&|=")
out <- lapply(clean, function(x){ma <- data.table(matrix(x[!x==""], nrow = 2, byrow = F ));
setnames(ma, as.character(ma[1,]));
ma[-1,]})
out <- rbindlist(out, fill = T)
out
location distance education education industry industry
1: NY 30 University NA NA NA
2: CA 30 Highschool University NA NA
3: MN 10 NA NA Healthcare NA
4: VT 30 University NA IT Business

Use zoo read and split a data frame over a column

I have a table containing observations on scores of restaurants(identified by ID). The variable mean is the mean rating of reviews received in a week-long window centered on each day (i.e. from 3 days before till 3 days later), and the variable count is the number of reviews received in the same window (see the code below for a dput of a randomly-generated sample of my data frame).
I am interested in looking at those restaurants that contain big spikes in either variable (like all of a sudden their mean rating goes up by a lot, or drops suddenly). For those restaurants, I would like to investigate what's going on by plotting the distribution (I have lots of restaurants so I can't do it manually and I have to restrict my domain for semi-manual inspection).
Also, since my data is day-by-day, I would like it to be less granular. In particolar, I want to average all the ratings or counts for a given month in a single value.
I think zoo should help me do it nicely: given the data frame in the example, I think I can convert it to a zoo time series which is aggregate the way I want and split the way I want by using:
z <- read.zoo(df, split = "restaurantID",
format = "%m/%d/%Y", index.column = 2, FUN = as.yearmon, aggregate = mean)
however, splitting on restaurantID does not yield the expected result. What I get instead is lots of NAs:
mean.1006054 count.1006054 mean.1006639 count.1006639 mean.1006704 count.1006704 mean.1007177 count.1007177
Lug 2004 NA NA NA NA NA NA NA NA
Ago 2004 NA NA NA NA NA NA NA NA
Nov 2004 NA NA NA NA NA NA NA NA
Gen 2005 NA NA NA NA NA NA NA NA
Feb 2005 NA NA NA NA NA NA NA NA
Mar 2005 NA NA NA NA NA NA NA NA
mean.1007296 count.1007296 mean.1007606 count.1007606 mean.1007850 count.1007850 mean.1008272 count.1008272
Lug 2004 NA NA NA NA NA NA NA NA
Ago 2004 NA NA NA NA NA NA NA NA
Nov 2004 NA NA NA NA NA NA NA NA
Gen 2005 NA NA NA NA NA NA NA NA
Feb 2005 NA NA NA NA NA NA NA NA
Mar 2005 NA NA NA NA NA NA NA NA
Note that it works if I don't split it on the restaurantID column.
df$website <- NULL
> z <- read.zoo(df, format = "%m/%d/%Y", index.column = 2, FUN = as.yearmon, aggregate = mean)
> head(z)
restaurantID mean count
Lug 2004 1418680 3.500000 1
Ago 2004 1370457 5.000000 1
Nov 2004 1324645 4.333333 1
Gen 2005 1425933 1.920000 1
Feb 2005 1315289 3.000000 1
Mar 2005 1400577 2.687500 1
Also, plot.zoo(z) works but of course the produced graph has no meaning for me.
My questions are:
1) How can I filter the restaurants that have the higher "month-month" spikes in either column?
2) How can I split on restaurantID and plot the time series of only such restaurants?
DATA HERE (wouldn't fit SO's word limit)
Try:
# helper function to calculate change per time interval in a sequence
difflist <- function(v) {rr <- 0; for (i in 2:length(v)) {rr <- c(rr, v[i] - v[i-1])}; return(rr) }
# make center as dates
df$center <- as.Date(df$center,format='%m/%d/%Y')
# sort data frame in time order
df <- df[order(df$restaurantID, df$center),]
# now calculate the change in each column
deltas <- ddply(df, .(restaurantID), function(x) {cbind(center = x$center, delta_mean = difflist(x$mean), delta_count = difflist(x$count)) } )
# filter out only the big spikes
deltas_big <- subset(deltas, delta_mean > 2 | delta_count > 3)
# arrange the data
delta_melt <- melt(deltas_big,id.vars=c('restaurantID','center'))
# now plot by time
ggplot(delta_melt, aes(x=center,y=value,color=variable)) + geom_point()
The robfilter r package was developed to filter time series data to pick out outliers based on robust statistics methods for time series analysis. You can use the adore.filter function to fit a pattern to the data and then pick the outliers that deviate far from the signal.

Resources