cleaning data - expanding one column to multiple columns in a dataframe - r

The textsample below is in one column. Using R, I hope to separate it into 5 columns with the following headings: "Name" , "Location", "Date", "Time", "Warning" . I have tried separate() and strsplit() and haven't succeeded yet. I hope someone here can help.
textsample <- "Name : York-APC-UPS\r\n
Location : York SCATS Zigzag Road\r\n
Contact : Mechanical services\r\n
\r\n
http://York-APC-UPS.domain25.minortracks.wa.gov.au\r\n
http://192.168.70.56\r\n
http://FE81::3C0:B8FF:FE6D:8065\r\n
Serial Number : 5A1149T24253\r\n
Device Serial Number : 5A1149T24253\r\n
Date : 12/06/2018\r\n
Time : 08:45:46\r\n
Code : 0x0125\r\n
\r\n
Warning : A high humidity threshold violation exists for integrated Environmental Monitor TH Sensor
(Port 1 Temp 1 at Port 1) reporting over 50%CD.\r\n"

Here's an approach that should at least get you started:
We can use extract from tidyr extract the text of interest with regular expressions.
Then we can use mutate_all to apply the same str_replace to get rid of the labels.
library(dplyr)
library(tidyr)
library(stringr)
as.data.frame(extsample) %>%
extract(1, into=c("Name","Location","Date","Time","Warning"),
regex = "(Name : .+)[^$]*(Location : .+)[^$]*(Date : .+)[^$]*(Time : .+)[^$]*(Warning : .+)[^$]*") %>%
mutate_all(list(~str_replace(.,"^\\w+ : ","")))
# Name Location Date Time
#1 York-APC-UPS York SCATS Zigzag Road 12/06/2018 08:45:46
# Warning
#1 A high humidity threshold violation exists for integrated Environmental Monitor TH Sensor
This relies on capturing groups with (), see help(tidyr::extract) for details. We use [^$]* to match anything other than the end of the string 0 or more times between the groups.
Note the first argument to extract is 1, which indicates the first (and only) column of the data.frame I made from your example data. Change this as necessary.

Related

Is there a way to map or match people's names to religions in R?

I'm working on a paper on electoral politics and tried using this dataset to calculate the share of the electorate that each religion,so I created an if() function and a Christian variable and tried to increase the number of Christians by one whenever a Christian name pops up, but was unable to do so. Would appreciate it if you could help me with this
library(dplyr)
library(ggplot2)
Christian=0
if(Sample...Sheet1$V2=="James"){
Christian=Christian+1
}
PS
The Output
Warning message:
In if (Sample...Sheet1$V2 == "James") { :
the condition has length > 1 and only the first element will be used
Notwithstanding my comment about the fundamental non-validity of this approach, here’s how you would solve this general problem in R:
Generate a lookup table of the different names and categories — this table is independent of your input data:
religion_lookup = tribble(
~ Name, ~ Religion,
'James', 'Christian',
'Christopher', 'Christian',
'Ahmet', 'Muslim',
'Mohammed', 'Muslim',
'Miriam', 'Jewish',
'Tarjinder', 'Sikh'
)
match your input data against the lookup table (I’m using an input table data with a column Name instead of your Sample...Sheet1$V2):
matched = match(data$Name, religion_lookup$Name)
religion = religion_lookup$Religion[matched]
Count the results:
table(religion)
religion
Christian Jewish Muslim Sikh
2 5 3 1
Note the lack of ifs and loops in the above.
Christian <- sum( Sample...Sheet1$V2=="James" )
There goes, don't need the if block.

how to use regexpr to identify patters in icd10 data

I am working with icd10 data, and I wish to create new variables called complication based on the pattern "E1X.9X", using regular expression but I keep getting an error. please help
dm_2$icd9_9code<- (E10.49, E11.51, E13.52, E13.9, E10.9, E11.21, E16.0)
dm_2$DM.complications<- "present"
dm_2$DM.complications[regexpr("^E\\d{2}.9$",dm_2$icd9_code)]<- "None"
# Error in dm_2$DM.complications[regexpr("^E\\d{2}.9", dm_2$icd9_code)] <-
# "None" : only 0's may be mixed with negative subscripts
I want
icd9_9code complications
E10.49 present
E11.51 present
E13.52 present
E13.9 none
E10.9 none
E11.21 present
This problem has already been solved. The 'icd' R package which me and co-authors have been maintaining for five years can do this. In particular, it uses standardized sets of comorbidities, including the diabetes with complications you seek, from AHRQ, Elixhauser original, Charlson, etc..
E.g., for ICD-10 AHRQ, you can see the codes for diabetes with complications here. From icd 4.0, these include ICD-10 codes from the WHO, and all years of ICD-10-CM.
icd::icd10_map_ahrq$DMcx
To use them, first just take your patient data frame and try:
library(icd)
pts <- data.frame(visit_id = c("encounter-1", "encounter-2", "encounter-3",
"encounter-4", "encounter-5", "encounter-6"), icd10 = c("I70401",
"E16", "I70.449", "E13.52", "I70.6", "E11.51"))
comorbid_ahrq(pts)
# and for diabetes with complications only:
comorbid_ahrq(pts)[, "DMcx"]
Or, you can get a data frame instead of a matrix this way:
comorbid_ahrq(pts, return_df = TRUE)
# then you can do:
comorbid_ahrq(pts, return_df = TRUE)$DMcx
If you give an example of the source data and your goal, I can help more.
Seems like there are a few errors in your code, I'll note them in the code below:
You'll want to start with wrapping your ICD codes with quotes: "E13.9"
dm_2 <- data.frame(icd9_9code = c("E10.49", "E11.51", "E13.52", "E13.9", "E10.9", "E11.21", "E16.0"))
Next let's use grepl() to search for the particular ICD pattern. Make sure you're applying it to the proper column, your code above is attempting to use dm_2$icd9_code and not dm_2$icd9_9code:
dm_2$DM.complications <- "present"
dm_2$DM.complications[grepl("^E\\d{2}.9$", dm_2$icd9_9code)] <- "None"
Finally,
dm_2
#> icd9_9code DM.complications
#> 1 E10.49 present
#> 2 E11.51 present
#> 3 E13.52 present
#> 4 E13.9 None
#> 5 E10.9 None
#> 6 E11.21 present
#> 7 E16.0 present
A quick side note -- there is a wonderful ICD package you may find handy as well: https://cran.r-project.org/web/packages/icd/index.html

In R - Substring based on repeating character

I have two tables. In one table(IPTable), there is a column in one table containing IP addresses (which look like this: "10.100.20.13"). I am trying to match each of those to the data in a column in the other table (SubnetTable) which holds subnet addresses (which look like this: "10.100.20", essentially a shortened version of the IP address - everything before the 3rd period). Both variables appear to be chr vectors.
Essentially the raw IP data looks like this:
IPTable$IPAddress
10.100.20.13
10.100.20.256
10.100.200.23
101.10.13.43
101.100.200.1
and the raw Subnet data I am comparing it against looks like this:
SubnetTable$Subnet
Varies
10.100.20
Remote Subnet
10.100.200, 101.10.13
Unknown Subnet
Notes:
sometimes the subnet entries contain two subnets within a field separated by a comma
the IPAddress field doesn't have a consistent placement between the groups (e.g. - there could exist "10.110.20.13" as well as "101.10.20.13")
In a different scripting application I am able to simply compare these as strings in a foreach loop. In this logic, it loops through each entry in the Subnet data(SubnetTable), splits it against the comma (to account for the entries with multiple subnet addresses) and then checks to see if it finds a match in the IP Address field (e.g. - is "10.100.20" found anywhere in "10.100.20.13"). I use that field for a join/merge. In using R I understand that foreach looping isn't the most efficient way I should this and in the other application it takes a long time which is part of the reason I am moving to R.
I didn't see a method of doing the same thing against this type of data (I have done merges and joins but I don't see a way of doing that without getting to two variables alike enough to use to link the two tables).
In the past I have been able to use R methods like sqldf, charindex and leftstr to look for a particular character "." and pull everything before it but the difficulty here is that to do it that way, I need to look for the 3rd occurance of the period "." instead of the first. I didn't see a way of doing that but if there is a way, that might be best.
My next attempt was to use strsplit and sapply on the IP address with the idea of reassembling only the first three portions to create a subnet to match against (in a new column/variable). That looked like this:
IPClassC <- sapply(strsplit(Encrypt_Remaining5$IPAddress, "[.]"), `[`)
This gives a "Large List" which makes the data look like this:
chr [1:4] "10" "100" "20 "13"
But when attempting to put it back together I am also losing the period between the octets. Sample code:
paste(c(IPClassC[[1]][1:3]), sep ="[.]", collapse = "")
This produces something like this:
"1010020"
In the end I have two questions:
1) Is there a method for doing the easy comparison I did earlier (essentially doing a merge from the subnet variable of Table1 to "most" of the IP Address of Table2 basing it off of everything before the third period (".") without having to split it out and reassemble the IPAddress field?
2) If not, am I on the right track with trying to split and then reassemble? If so, what am I doing wrong with the reassembly or is there an easier/better way of doing this?
Thanks and let me know what else you need.
I think what you’re essentially asking is how to join these two tables, right? If this is the case, I would do it like this:
library(tidyr)
suppressPackageStartupMessages(library(dplyr))
IPTable <-
data.frame(
IPAddress =
c(
"10.100.20.13",
"10.100.20.256",
"10.100.200.23",
"101.10.13.43",
"101.100.200.1"
),
stringsAsFactors = FALSE
)
I am not sure, whether your SubnetTable really looks like this, i.e. mixing subnet addresses with other text? Anyway, this solution essentially ignores the other text.
SubnetTable <-
data.frame(
subnet_id = 1:5,
Subnet =
c(
"Varies",
"10.100.20",
"Remote Subnet",
"10.100.200, 101.10.13",
"Unknown Subnet"
),
stringsAsFactors = FALSE
)
First we separate multiple subnets into multiple rows. Note that this assumes that the SubnetTable$Subnet vector only contains a ", " to separate two subnets. I.e. there are no strings like this "Unknown, Subnet", or else these will be separated into two rows as well.
SubnetTable_tidy <- tidyr::separate_rows(SubnetTable, Subnet, sep = ", ")
SubnetTable_tidy
#> subnet_id Subnet
#> 1 1 Varies
#> 2 2 10.100.20
#> 3 3 Remote Subnet
#> 4 4 10.100.200
#> 5 4 101.10.13
#> 6 5 Unknown Subnet
Next we extract the Subnet by replacing/deleting a dot (\\.) followed by one to three numbers (\\d{1,3}) followed by the end of the string ($) from IPTable$IPAddress.
IPTable$Subnet <- gsub("\\.\\d{1,3}$", "", IPTable$IPAddress)
IPTable
#> IPAddress Subnet
#> 1 10.100.20.13 10.100.20
#> 2 10.100.20.256 10.100.20
#> 3 10.100.200.23 10.100.200
#> 4 101.10.13.43 101.10.13
#> 5 101.100.200.1 101.100.200
Now we can join both tables.
IPTable_subnet <-
dplyr::left_join(
x = IPTable,
y = SubnetTable_tidy,
by = "Subnet"
)
IPTable_subnet
#> IPAddress Subnet subnet_id
#> 1 10.100.20.13 10.100.20 2
#> 2 10.100.20.256 10.100.20 2
#> 3 10.100.200.23 10.100.200 4
#> 4 101.10.13.43 101.10.13 4
#> 5 101.100.200.1 101.100.200 NA
unlist(strsplit(SubnetTable$Subnet,split=",")) %in%
gsub("^(\\d{2,3}.\\d{2,3}.\\d{2,3}).*$","\\1",IPTable$IPAddress)
This will give you a vector of class logical that matches a TRUE/FALSE to each item in Subnet (giving multiple responses for items with commas in them). Alternatively, you can flip the two sides to get a list of logicals for each of the IPAddress, telling you if it exists in the subnet list.
Is this what you were looking for?
You can also achieve a similar result with charmatch:
sapply(strsplit(SubnetTable$Subnet, split=","), charmatch, IPTable$IPAddress)
This gives the following result with your sample data:
[[1]]
[1] NA
[[2]]
[1] 0
[[3]]
[1] NA
[[4]]
[1] 3 NA
[[5]]
[1] NA
Note that when there is a single match, you get the index for it, but where there are multiple matches, the value is 0.
Finally, the flip of this one will give you a list of the indexes in subnet where the IPaddresses match to:
sapply(gsub("^(\\d{2,3}.\\d{2,3}.\\d{2,3}).*$","\\1",IPTable$IPAddress), charmatch, SubnetTable$Subnet)
results in:
10.100.20 10.100.20 10.100.200 101.10.13 101.100.200
2 2 4 NA NA

Extracting a value based on multiple conditions in R

Quick question - I have a dataframe (severity) that looks like,
industryType relfreq relsev
1 Consumer Products 2.032520 0.419048
2 Biotech/Pharma 0.650407 3.771429
3 Industrial/Construction 1.327913 0.609524
4 Computer Hardware/Electronics 1.571816 2.019048
5 Medical Devices 1.463415 3.028571
6 Software 0.758808 1.314286
7 Business/Consumer Services 0.623306 0.723810
8 Telecommunications 0.650407 4.247619
if I wanted to pull the relfreq of Medical Devices (row 5) - how could I subset just that value?
I was thinking about just indexing and doing severity$relfreq[[5]], but I'd be using this line in a bigger function where the user would specify the industry i.e.
example <- function(industrytype) {
weight <- relfreq of industrytype parameter
thing2 <- thing1*weight
return(thing2)
}
So if I do subset by an index, is there a way R would know which index corresponds to the industry type specified in the function parameter? Or is it easier/a way to just subset the relfreq column by the industry name?
You would require to first select the row of interest and then keep the 2 column you requested (industryType and relfreq).
There is a great package that allows you to do this intuitively with tidyverse library(tidyverse)
data_want <- severity %>%
subset(industryType =="Medical Devices") %>%
select(industryType, relfreq)
Here you read from left to right with the %>% serving as passing the result to the next step as if nesting.
I think that selecting whole row is better, then choose column which you would like to see.
frame <- severity[severity$industryType == 'Medical Devices',]
frame$relfreq

Plot a histogram of subset of a data

!The image shows the screen shot of the .txt file of the data.
The data consists of 2,075,259 rows and 9 columns
Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available.
Only data from the dates 2007-02-01 and 2007-02-02 is needed.
I was trying to plot a histogram of "Global_active_power" in the above mentioned dates.
Note that in this dataset missing values are coded as "?"]
This is the code i was trying to plot the histogram:
{
data <- read.table("household_power_consumption.txt", header=TRUE)
my_data <- data[data$Date %in% as.Date(c('01/02/2007', '02/02/2007'))]
my_data <- gsub(";", " ", my_data) # replace ";" with " "
my_data <- gsub("?", "NA", my_data) # convert "?" to "NA"
my_data <- as.numeric(my_data) # turn into numbers
hist(my_data["Global_active_power"])
}
After running the code it is showing this error:
Error in hist.default(my_data["Global_active_power"]) :
invalid number of 'breaks'
Can you please help me spot the mistake in the code.
Link of the data file : https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip
You need to provide the separator (";") explicitly and your types aren't what you think they are, observe:
data <- read.table("household_power_consumption.txt", header=TRUE, sep=';', na.strings='?')
data$Date <- as.Date(data$Date, format='%d/%m/%Y')
bottom.date <- as.Date('01/02/2007', format='%d/%m/%Y')
top.date <- as.Date('02/02/2007', format='%d/%m/%Y')
my_data <- data[data$Date > bottom.date & data$Date < top.date,3]
hist(my_data)
Gives as the plot. Hope that helps.
Given you have 2m rows (though not too many columns), you're firmly into fread territory;
Here's how I would do what you want:
library(data.table)
data<-fread("household_power_consumption.txt",sep=";", #1
na.strings=c("?","NA"),colClasses="character" #2
)[,Date:=as.Date(Date,format="%d/%m/%Y")
][Date %in% seq(from=as.Date("2007-02-01"), #3
to=as.Date("2007-02-02"),by="day")]
numerics<-setdiff(names(data),c("Date","Time")) #4
data[,(numerics):=lapply(.SD,as.numeric),.SDcols=numerics]
data[,hist(Global_active_power)] #5
A brief explanation of what's going on
1: See the data.table vignettes for great introductions to the package. Here, given the structure of your data, we tell fread up front that ; is what separates fields (which is nonstandard)
2: We can tell fread up front that it can expect ? in some of the columns and should treat them as NA--e.g., here's data[8640] before setting na.strings:
Date Time Global_active_power Global_reactive_power Voltage Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3
1: 21/12/2006 11:23:00 ? ? ? ? ? ? NA
Once we set na.strings, we sidestep having to replace ? as NA later:
Date Time Global_active_power Global_reactive_power Voltage Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3
1: 21/12/2006 11:23:00 NA NA NA NA NA NA
On the other hand, we also have to read those fields as characters, even though they're numeric. This is something I'm hoping fread will be able to handle automatically in the future.
data.table commands can be chained (from left to right); I'm using this to subset the data before it's assigned. It's up to you whether you find that more or less readable, as there's only marginal performance differences.
Since we had to read the numeric fields as strings, we now recast them as numeric; this is the standard data.table syntax for doing so.
Once we've got our data subset as we like and of the right type, we can pass hist as an argument in j and get what we want.
Note that if all you wanted from this data set was the histogram, you could have condensed the code a bit:
ok_dates<-seq(from=as.Date("2007-02-01"),
to=as.Date("2007-02-02"),by="day")
fread("household_power_consumption.txt",sep=";",
select=c("Date","Global_active_power"),
na.strings=c("?","NA"),colClasses="character"
)[,Date:=as.Date(Date,format="%d/%m/%Y")
][Date %in% ok_dates,hist(as.numeric(Global_active_power))]

Resources