Extracting numeric portion from a character in the Data Frame - r

I am trying to extract the numeric portion of this string from DF$Numbers like changing W12K32 to 1232
Current DF
Name Numbers
1 Alex W12K32
2 Tom S12WE23
3 Eric T1243
Desired Output
Name Numbers
1 Alex 1232
2 Tom 1223
3 Eric 1243

Use sub and strip off all non numeric characters:
df$Numbers <- gsub("\\D+", "", df$Numbers)
df
Name Numbers
1 Alex 1232
2 Tom 1223
3 Eric 1243
Data:
df <- data.frame(Name=c("Alex", "Tom", "Eric"), Numbers=c("W12K32", "S12WE23", "T1243"),
stringsAsFactors=FALSE)

Related

Move information to new column if the first value of the cell is a four-digit number

I have a column with addresses. The data is not clean and the information includes street and house number or sometimes postcode and city. I would like to move the postcode and city information to another column with R, while street and house number stay in the old place. The postcode is a 4 digit number string. I am grateful for any suggestion for a solution.
An ifelse with grepl should help -
library(dplyr)
df <- df %>%
mutate(Strasse = ifelse(grepl('^\\d{4}', Halter), '', Halter),
Ort = ifelse(Strasse == '', Halter, ''))
# Line Halter Strasse Ort
#1 1 1007 Abc 1007 Abc
#2 2 1012 Long words 1012 Long words
#3 3 Enelbach 54 Enelbach 54
#4 4 Abcd 56 Abcd 56
#5 5 Engasse 21 Engasse 21
grepl('^\\d{4}', Halter) returns TRUE if it finds a 4-digit number at the start of the string else returns FALSE.
data
It is easier to help if you provide data in a reproducible format
df <- data.frame(Line = 1:5,
Halter = c('1007 Abc', '1012 Long words', 'Enelbach 54',
'Abcd 56', 'Engasse 21'))
In addition to the neat solution of #Ronak Shah, if you want to use base R
df <- data.frame(Line = 1:5,
Halter = c('1007 Abc', '1012 Long words', 'Enelbach 54',
'Abcd 56', 'Engasse 21'))
df$Strasse <- with(df, ifelse(grepl('^\\d{4}', Halter), '', Halter))
df$Ort <- with(df, ifelse(Strasse == '', Halter, ''))
> head(df)
Line Halter Strasse Ort
1 1 1007 Abc 1007 Abc
2 2 1012 Long words 1012 Long words
3 3 Enelbach 54 Enelbach 54
4 4 Abcd 56 Abcd 56
5 5 Engasse 21 Engasse 21
An option is also with separate
library(dplyr)
library(tidyr)
df %>%
separate(Halter, into = c("Strasse", "Ort"), sep = "(?<=[0-9])$|^(?=[0-9]{4} )")
Line Strasse Ort
1 1 1007 Abc
2 2 1012 Long words
3 3 Enelbach 54
4 4 Abcd 56
5 5 Engasse 21
data
df <- structure(list(Line = 1:5, Halter = c("1007 Abc", "1012 Long words",
"Enelbach 54", "Abcd 56", "Engasse 21")), class = "data.frame", row.names = c(NA,
-5L))
Suisse postal codes are made up of 4 digits:
library(dplyr)
library(stringr)
df %>%
mutate(Strasse = str_extract(Halter, '\\d{4}\\s.+'))
Line Halter Strasse
1 1 1007 Abc 1007 Abc
2 2 1012 Long words 1012 Long words
3 3 Enelbach 54 <NA>
4 4 Abcd 56 <NA>
5 5 Engasse 21 <NA>

How to FILL DOWN (autofill) value , eg replace NA with first value in group, using data.table in R?

Very simple and common task:
I need to FILL DOWN in data.table (similar to autofill function in MS Excel) so that
library(data.table)
DT <- fread(
"Paul 32
NA 45
NA 56
John 1
NA 5
George 88
NA 112")
becomes
Paul 32
Paul 45
Paul 56
John 1
John 5
George 88
George 112
Thank you!
Yes the best way to do this is to use #Rui Barradas idea of the zoo package. You can simply do it in one line of code with the na.locf function.
library(zoo)
DT[, V1:=na.locf(V1)]
Replace the V1 with whatever you name your column after reading in the data with fread. Good luck!
For example 2, you can consider using stats::spline for extrapolation as follows:
DT2[is.na(V2), V2 :=
as.integer(DT2[, spline(.I[!is.na(V2)], V2[!is.na(V2)], xout=.I[is.na(V2)]), by=.(V1)]$y)]
output:
V1 V2
1: Paul 1
2: Paul 2
3: Paul 3
4: Paul 4
5: John 100
6: John 110
7: John 120
8: John 130
data:
DT2 <- fread(
"Paul, 1
Paul, 2
Paul, NA
Paul, NA
John, 100
John, 110
John, NA
John, NA")

Use R to reduce a dataset to only rows that have matching data in two separate columns [duplicate]

This question already has answers here:
Display duplicate records in data.frame and omit single ones
(4 answers)
Closed 5 years ago.
I am trying to reduce a massive data set to only the rows that match in the "FIRSTNAME" column and also the "SSN" column. In order to help clarify I made a small set.
Very small sample
CUSTNUM FIRSTNAME SSN
1234 Matt 111
4321 Mark 222
5678 Mike 333
9875 Matt 444
1092 Matt 111
I want it to return
CUSTNUM FIRSTNAME SSN
1234 Matt 111
1092 Matt 111
Because they match in both columns.
My dataset has over 2 million rows of customer data in it so I need a way to id possible duplicate records.
There's a handy function for this in base r:
# suppose you have a dataframe, df, and you want to know which rows
# have duplicate values in both the FIRSTNAME and SSN columns together:
df$dup <- duplicated(df[,c('name','SSN')],fromLast=FALSE)
df$dup <- ifelse(duplicated(df[,c('name','SSN')],fromLast=TRUE),yes=TRUE,no=df$dup)
# return dups
df.answer <- df[which(df$dup),]
Alternatively, with dplyr:
library(tidyverse)
df %>%
group_by(FIRSTNAME, SSN) %>%
filter(n() > 1)
# A tibble: 2 x 3
# Groups: FIRSTNAME, SSN [1]
CUSTNUM FIRSTNAME SSN
<int> <fctr> <int>
1 1234 Matt 111
2 1092 Matt 111
# Sample data
df <- read.table(text =
"CUSTNUM FIRSTNAME SSN
1234 Matt 111
4321 Mark 222
5678 Mike 333
9875 Matt 444
1092 Matt 111", sep = "", header = T);
subsetting based on the pasted columns:
subset(df, duplicated(paste(FIRSTNAME, SSN)) | duplicated(paste(FIRSTNAME, SSN), fromLast = T))
#1 1234 Matt 111
#5 1092 Matt 111

Add column to R dataframe that is length of string in another column

This should be EASY but I can't figure it out and search didn't help. I'd like to add a column to a dataframe that is just the length of the strings in another column.
So say I have a data frame of names like such:
Name Last
1 John Doe
2 Edgar Poe
3 Walt Whitman
4 Jane Austen
I'd like to append a new column with the string length of, say, the last name, so it would look like:
Name Last Length
1 John Doe 3
2 Edgar Poe 3
3 Walt Whitman 7
4 Jane Austen 6
Thanks
We can use str_count from stringr
library(stringr)
df1$Length <- str_count(df1$Last)
df1$Length
[1] 3 3 7 6
If you want to filter by the length, based on column, then do the following:
library(dplyr)
df<- df %>%
filter(nchar(Last) <= 3)

How to rbind when only some of the columns match

I have about 18 dataframes which are essentially frequency counts of the elements stored in the column Rptnames. They all have some different and some the same elements in the Rptnames columns so they look like this
dataframe called GroupedTableProportiondelAll
Rptname freq
bob 4324234
jane 433
ham 4324
tim 22
dataframe called GroupedTableProportiondelLUAD
Rptname freq
bob 987
jane 223
jonny 12
jim 98092
I am trying to set up a table so that the Rptname becomes the column and each row is the frequencies. This is so that I can combine all the dataframes.
I have tried the following
GroupedTableProportiondelAll_T <- as.data.frame(t(GroupedTableProportiondelAll))
GroupedTableProportiondelLUAD_T <- as.data.frame(t(GroupedTableProportiondelLUAD))
total <- rbind(GroupedTableProportiondelLUAD_T, GroupedTableProportiondelAll_T)
but I get the error
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
So the question is
a) how can I do rbind (cbind would also do without transposing I suppose) so that the bind can happen without needing to match.
b) would merge be better here
c) in either is there a way to enter zero for empty values
d) P'raps there's a better way to do this like matrices which Im not really familiar with? I know its 4 questions but the central question's the same- how to bind when not all the rows or columns are matching
An alternative to the rbind + dcast technique that would use the tidyverse.
Use pipes (%>%) to first use bind_rows() to bind all your dataframes together while simultaneously creating a dataframe id column (in this case I just called the variable "df"). Then use spread() to move unique "Rptname" values to become column names and spreading the values of "freq" across the new columns. "Rptname" is the key and "freq" is the value in this case.
It would look like this:
Input:
GTP_A
Rptname freq
1 bob 4324234
2 jane 433
3 ham 4324
4 tim 22
GTP_LUAD
Rptname freq
1 bob 987
2 jane 223
3 jonny 12
4 jim 98092
Code:
GroupTable <- bind_rows(GTP_A,GTP_LUAD, .id = "df") %>%
spread(Rptname, freq)
Output:
GroupTable
df bob ham jane jim jonny tim
1 1 4324234 4324 433 NA NA 22
2 2 987 NA 223 98092 12 NA
UPDATE:
As of the release of tidyr 1.0.0 on 2019/09/13 spread() and gather() have been retired and replaced by pivot_wider() and pivot_longer(), respectively. From the release notes Hadley Wickem states "spread() and gather() won’t go away, but they’ve been retired which means that they’re no longer under active development."
In order to get the same output as above, you will now need to first arrange() by Rptname then use pivot_wider(). If you do not arrange first you will get a similar output but the column order will not be the same as the output from spread().
GroupTable <- bind_rows(GTP_A, GTP_LUAD, .id = "df") %>%
arrange(Rptname) %>%
pivot_wider(names_from = Rptname, values_from = freq)
You could first rbind the dataframes after adding a column to identify the data.frame. Then use dcast function from reshape2 package.
rpt1
## Rptname freq df
## 1 bob 4324234 rpt1
## 2 jane 433 rpt1
## 3 ham 4324 rpt1
## 4 tim 22 rpt1
rpt2
## Rptname freq df
## 1 bob 987 rpt2
## 2 jane 223 rpt2
## 3 jonny 12 rpt2
## 4 jim 98092 rpt2
rpt1$df <- "rpt1"
rpt2$df <- "rpt2"
rpt <- rbind(rpt1, rpt2)
dcast(data = rpt, df ~ Rptname, value.var = "freq")
## df bob ham jane tim jim jonny
## 1 rpt1 4324234 4324 433 22 NA NA
## 2 rpt2 987 NA 223 NA 98092 12

Resources