Related
I am having some trouble cleaning up my data. It consists of a list of sold houses. It is made up of the sell price, no. of rooms, m2 and the address.
As seen below the address is in one string.
Head(DF, 3)
Address Price m2 Rooms
Petersvej 1772900 Hoersholm 10.000 210 5
Annasvej 2B2900 Hoersholm 15.000 230 4
Krænsvej 125800 Lyngby C 10.000 210 5
A Mivs Alle 119800 Hjoerring 1.300 70 3
The syntax for the address coloumn is: road name, road no., followed by a 4 digit postalcode and the city name(sometimes two words).
Also need to extract the postalcode.. been looking at 'stringi' package haven't been able to find any examples..
any pointers are very much appreciated
1) Using separate in tidyr separate the subfields of Address into 3 fields merging anything left over into the last and then use separate again to split off the last 4 digits in the Number column that was generated in the first separate.
library(dplyr)
library(tidyr)
DF %>%
separate(Address, into = c("Road", "Number", "City"), extra = "merge") %>%
separate(Number, into = c("StreetNo", "Postal"), sep = -4)
giving:
Road StreetNo Postal City Price m2 Rooms CITY
1 Petersvej 77 2900 Hoersholm 10 210 5 Hoersholm
2 Annasvej 121B 2900 Hoersholm 15 230 4 Hoersholm
3 Krænsvej 12 5800 Lyngby C 10 210 5 C
2) Alternately, insert commas between the subfields of Address and then use separate to split the subfields out. It gives the same result as (1) on the input shown in the Note below.
DF %>%
mutate(Address = sub("(\\S.*) +(\\S+)(\\d{4}) +(.*)", "\\1,\\2,\\3,\\4", Address)) %>%
separate(Address, into = c("Road", "Number", "Postal", "City"), sep = ",")
Note
The input DF in reproducible form is:
DF <-
structure(list(Address = structure(c(3L, 1L, 2L), .Label = c("Annasvej 121B2900 Hoersholm",
"Krænsvej 125800 Lyngby C", "Petersvej 772900 Hoersholm"), class = "factor"),
Price = c(10, 15, 10), m2 = c(210L, 230L, 210L), Rooms = c(5L,
4L, 5L), CITY = structure(c(2L, 2L, 1L), .Label = c("C",
"Hoersholm"), class = "factor")), class = "data.frame", row.names = c(NA,
-3L))
Update
Added and fixed (2).
Check out the cSplit function from the splitstackshape package
library(splitstackshape)
df_new <- cSplit(df, splitCols = "Address", sep = " ")
#This will split your address column into 4 different columns split at the space
#you can then add an ifelse block to combine the last 2 columns to make up the city like
df_new$City <- ifelse(is.na(df_new$Address_4), as.character(df_new$Address_3), paste(df_new$Address_3, df_new$Address_4, sep = " "))
One way to do this is with regex.
In this instance you may use a simple regular expression which will match all alphabetical characters and space characters which lead to the end of the string, then trim the whitespace off.
library(stringr)
DF <- data.frame(Address=c("Petersvej 772900 Hoersholm",
"Annasvej 121B2900 Hoersholm",
"Krænsvej 125800 Lyngby C"))
DF$CITY <- str_trim(str_extract(DF$Address, "[a-zA-Z ]+$"))
This will give you the following output:
Address CITY
1 Petersvej 772900 Hoersholm Hoersholm
2 Annasvej 121B2900 Hoersholm Hoersholm
3 Krænsvej 125800 Lyngby C Lyngby C
In R the stringr package is preferred for regex because it allows for multiple-group capture, which in this example could allow you to separate each component of the address with one expression.
I am trying to extract information from more than 2 columns (2 columns given as an example below) using a list and creating another column which contains the string from the list found from either one of the column specifying which column to look in first. I have the example below and what the desired output is. Hope that helps what I am exactly looking for.
A<-c("This contains NYU", "This has NYU", "This has XT", "This has FIT",
"Something something UNH","I got into UCLA","Hello XT")
B<-c("NYU","UT","USC","FIT","UNA","UCLA", "CA")
data<-data.frame(A,B)
list <- c("NYU","FIT","UCLA","CA","UT","USC")
A B
1 This contains NYU NYU
2 This has NYU UT
3 This has XT USC
4 This has FIT FIT
5 Something something UNH UNA
6 I got into UCLA UCLA
7 Hello XT CA
I would want the code to search from the list and look in column A first and if it cannot find the string then look in column B and if not then give null. By looking at the list, I would like the desired output to look like the below.
A B C
1 This contains NYU NYU NYU
2 This has NYU UT NYU
3 This has XT USC USC
4 This has FIT FIT FIT
5 Something something UNH UNA <NA>
6 I got into UCLA UCLA UCLA
7 Hello XT CA CA
You can transform your list to a regexpr and then apply R regexpr function :
expr <- paste0(list,collapse = "|")
# expr = "NYU|FIT|UCLA|CA|UT|USC" -> Reg expr means NYU or FIT or ......
data[,"C"] <- ""
cols <- rev(names(data)[-(which(names(data)=="C"))])
for(c in cols) {
index <- regexpr(expr,data[,c])
data[,"C"] <- ifelse(index != -1,substr(data[,c],index,index + attr(index,"match.length")-1),data[,"C"])
}
Hope that will helps
Gottavianoni
Another approach could be
#common between column A & vector l
C_tempA <- sapply(df$A, function(x) intersect(strsplit(as.character(x), split = " ")[[1]], l))
#common between column B & vector l
C_tempB <- sapply(df$B, function(x) intersect(as.character(x), l))
#column C calculation
df$C <- ifelse(C_tempA=="character(0)", C_tempB, C_tempA)
df$C[df$C=="character(0)"] <- NA
#final dataframe
df
Output is:
A B C
1 This contains NYU NYU NYU
2 This has NYU UT NYU
3 This has XT USC USC
4 This has FIT FIT FIT
5 Something something UNH UNA NA
6 I got into UCLA UCLA UCLA
7 Hello XT CA CA
Sample data:
df <- structure(list(A = structure(c(4L, 6L, 7L, 5L, 3L, 2L, 1L), .Label = c("Hello XT",
"I got into UCLA", "Something something UNH", "This contains NYU",
"This has FIT", "This has NYU", "This has XT"), class = "factor"),
B = structure(c(3L, 7L, 6L, 2L, 5L, 4L, 1L), .Label = c("CA",
"FIT", "NYU", "UCLA", "UNA", "USC", "UT"), class = "factor")), .Names = c("A",
"B"), row.names = c(NA, -7L), class = "data.frame")
l <- c("NYU","FIT","UCLA","CA","UT","USC")
Use library(tokenizers) from tokenizers package.
Merge two columns and create a new column with merged A and B
data$newC <- paste(data$A, data$B, sep = " " )
Then, follow below loop which will extract values in a vector and then u can cbind the vector in existing dataframe.
newcolumn <- 'X'
for (p in data$newC)
{
if (!is.na(p))
{
x <- which(is.element(unlist(tokenize_words(list, lowercase = TRUE)), unlist(tokenize_words(p, lowercase = TRUE, stopwords = NULL, simplify = FALSE))))
newcolumn <- append(newcolumn,ifelse(x[1]!= 0, list[x[1]], "NA"))
}
}
newcolumn <- newcolumn[-1]
newcolumn
data <- cbind(data, newcolumn)
Hope it helps.
I am getting output of above as what you expected.
Solution Image:
I have two columns . both are of character data type.
One column has strings and other has got strings with quote.
I want to compare both columns and find the no. of distinct names across the data frame.
string f.string.name
john NA
bravo NA
NA "john"
NA "hulk"
Here the count should be 2, as john is common.
Somehow i am not able to remove quotes from second column. Not sure why.
Thanks
The main problem I'm seeing are the NA values.
First, let's get rid of the quotes you mention.
dat$f.string.name <- gsub('["]', '', dat$f.string.name)
Now, count the number of distinct values.
i1 <- complete.cases(dat$string)
i2 <- complete.cases(dat$f.string.name)
sum(dat$string[i1] %in% dat$f.string.name[i2]) + sum(dat$f.string.name[i2] %in% dat$string[i1])
DATA
dat <-
structure(list(string = c("john", "bravo", NA, NA), f.string.name = c(NA,
NA, "\"john\"", "\"hulk\"")), .Names = c("string", "f.string.name"
), class = "data.frame", row.names = c(NA, -4L))
library(stringr)
table(str_replace_all(unlist(df), '["]', ''))
# bravo hulk john
# 1 1 2
R stores factors as integers. Therefore, when using the function identical, it cannot find when two factors are of the same name if they have different levels.
Here's an MWE:
y <- structure(list(portfolio_date = structure(c(1L, 1L, 1L, 2L, 2L,
2L), .Label = c("2000-10-31", "2001-04-30"), class = "factor"),
security = structure(c(2L, 2L, 1L, 3L, 2L, 4L), .Label = c("Currency Australia (Fwd)",
"Currency Euro (Fwd)", "Currency Japan (Fwd)", "Currency United Kingdom (Fwd)"
), class = "factor")), .Names = c("portfolio_date", "security"
), row.names = c(10414L, 10417L, 10424L, 21770L, 21771L, 21774L
), class = "data.frame")
x <- structure(list(portfolio_date = structure(1L, .Label = "2000-10-31", class = "factor"),
security = structure(1L, .Label = "Currency Euro (Fwd)", class = "factor")),
.Names = c("portfolio_date", "security"), row.names = 10414L, class = "data.frame")
identical(y[1,], x)
Returns FALSE
But if we look at the objects, they appear identical to the user
y[1,]
portfolio_date security
10414 2000-10-31 Currency Euro (Fwd)
x
portfolio_date security
10414 2000-10-31 Currency Euro (Fwd)
Ultimately I want to be able to do something like the following:
apply(y, 1, identical, x)
10414 10417 10424 21770 21771 21774
TRUE TRUE FALSE FALSE FALSE FALSE
which(apply(y, 1, identical, x))
1 2
Any suggestions as to how to achieve this? Thanks.
One option is to use the rowwise from dplyr to check row-by-row; If you need to compare the row.names at the same time then you need to create an id column for both, otherwise, it will return TRUE for the first two rows.
library(dplyr)
x$id <- row.names(x)
y$id <- row.names(y)
rowwise(y) %>% do(check = isTRUE(all.equal(., x, check.attributes = F))) %>% data.frame
check
1 TRUE
2 FALSE
3 FALSE
4 FALSE
5 FALSE
6 FALSE
In order to perform the comparison, the factors need to be converted into character objects.
By using base R alone here is a solution:
apply(apply(y, 2, as.character), 1, identical, apply(x, 2, as.character))
The inner apply loops convert each column in the source and target data frames to character objects and the outer apply loops through the rows.
If the x data frame has more than one row, the actual behavior may not be as expected.
Use the package 'compare'.
library(compare)
result <- NULL
for (i in 1:NROW(y)){
one <- compare(y[i,], x, dropLevels=T)
two <- one$detailedResult[1]==T & one$detailedResult[2]==T
result <- c(result, two)
}
as.character(result)#TRUE TRUE FALSE FALSE FALSE FALSE
Solution for data posted in OP
The example posted in the OP can be easily treated by using droplevels().
Let us first look at why the comparison identical(y[1,], x) returns FALSE:
str(y[1,])
#'data.frame': 1 obs. of 2 variables:
#$ portfolio_date: Factor w/ 2 levels "2000-10-31","2001-04-30": 1
#$ security : Factor w/ 4 levels "Currency Australia (Fwd)",..: 2
whereas
str(x)
#'data.frame': 1 obs. of 2 variables:
#$ portfolio_date: Factor w/ 1 level "2000-10-31": 1
#$ security : Factor w/ 1 level "Currency Euro (Fwd)": 1
So the difference lies in the factors, even though both objects are displayed in the same way, as shown in the OP's question.
This is where the function droplevels() is useful: it removes unused factors. By applying droplevels() to y[1,] with its redundant factors, we obtain:
identical(droplevels(y[1,]), x)
#[1] TRUE
If x also contains unused factors, it will be necessary to wrap it into droplevels(), too. In any case, it won't do any harm:
identical(droplevels(y[1,]), droplevels(x))
#[1] TRUE
General solution
Using droplevels() may not work if the real data is more complex than the data posted in the "MWE" of the OP. Such situations may include, e.g., equivalent entries in x and y[1,] that are stored as different factor levels. An example where droplevels() fails is given in the data section at the end of this answer.
The following solution represents an efficient possibility to treat such general situations. It works for the data posted in the OP as well as for the more complicated case of the data posted below.
First, two auxiliary vectors are created that contain only the characters of each row. By using paste() we can concatenate each row to a single character string:
temp_x <- apply(x, 1, paste, collapse=",")
temp_y <- apply(y, 1, paste, collapse=",")
With these vectors it becomes easily possible to compare rows of the original data.frames, even if the entries were originally stored as factors with different levels and numbering.
To identify which rows are identical, we can use the %in% operator, which is more appropriate than the function identical() in this case, as the former checks for equality of all possible row combinations, and not just individual pairs.
With these simple modifications the desired output can be obtained quickly and without further loops:
setNames(temp_y %in% temp_x, names(temp_y))
#10414 10417 10424 21770 21771 21774
# TRUE TRUE FALSE FALSE FALSE FALSE
which(temp_y %in% temp_x)
#[1] 1 2
y[temp_y %in% temp_x,]
# portfolio_date security
#10414 2000-10-31 Currency Euro (Fwd)
#10417 2000-10-31 Currency Euro (Fwd)
data
x <- structure(list(portfolio_date = structure(1:2, .Label = c("2000-05-15",
"2000-10-31"), class = "factor"), security = structure(c(2L, 1L),
.Label = c("Currency Euro (Fwd)", "Currency USD (Fwd)"),
class = "factor")), .Names = c("portfolio_date", "security"),
class = "data.frame", row.names = c("10234", "10414"))
y <- structure(list(portfolio_date = structure(c(1L, 1L, 1L, 2L, 2L, 2L),
.Label = c("2000-10-31", "2001-04-30"), class = "factor"),
security = structure(c(2L, 2L, 1L, 3L, 2L, 4L),
.Label = c("Currency Australia (Fwd)", "Currency Euro (Fwd)",
"Currency Japan (Fwd)", "Currency United Kingdom (Fwd)"),
class = "factor")), .Names = c("portfolio_date", "security"),
row.names = c(10414L, 10417L, 10424L, 21770L, 21771L, 21774L),
class = "data.frame")
As part of a project, I am currently using R to analyze some data. I am currently stuck with the retrieving few values from the existing dataset which i have imported from a csv file.
The file looks like:
For my analysis, I wanted to create another column which is the subtraction of the current value of x and its previous value. But the first value of every unique i, x would be the same value as it is currently. I am new to R and i was trying various ways for sometime now but still not able to figure out a way to do so. Request your suggestions in the approach that I can follow to achieve this task.
Mydata structure
structure(list(t = 1:10, x = c(34450L, 34469L, 34470L, 34483L,
34488L, 34512L, 34530L, 34553L, 34575L, 34589L), y = c(268880.73342868,
268902.322359863, 268938.194698248, 268553.521856105, 269175.38273083,
268901.619719038, 268920.864512966, 269636.604121984, 270191.206593437,
269295.344751692), i = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L)), .Names = c("t", "x", "y", "i"), row.names = c(NA, 10L), class = "data.frame")
You can use the package data.table to obtain what you want:
library(data.table)
setDT(MyData)[, x_diff := c(x[1], diff(x)), by=i]
MyData
# t x i x_diff
# 1: 1 34287 1 34287
# 2: 2 34789 1 502
# 3: 3 34409 1 -380
# 4: 4 34883 1 474
# 5: 5 34941 1 58
# 6: 6 34045 2 34045
# 7: 7 34528 2 483
# 8: 8 34893 2 365
# 9: 9 34551 2 -342
# 10: 10 34457 2 -94
Data:
set.seed(123)
MyData <- data.frame(t=1:10, x=sample(34000:35000, 10, replace=T), i=rep(1:2, e=5))
You can use the diff() function. If you want to add a new column to your existing data frame, the diff function will return a vector x-1 length of your current data frame though. so in your case you can try this:
# if your data frame is called MyData
MyData$newX = c(NA,diff(MyData$x))
That should input an NA value as the first entry in your new column and the remaining values will be the difference between sequential values in your "x" column
UPDATE:
You can create a simple loop by subsetting through every unique instance of "i" and then calculating the difference between your x values
# initialize a new dataframe
newdf = NULL
values = unique(MyData$i)
for(i in 1:length(values)){
data1 = MyData[MyData$i = values[i],]
data1$newX = c(NA,diff(data1$x))
newdata = rbind(newdata,data1)
}
# and then if you want to overwrite newdf to your original dataframe
MyData = newdf
# remove some variables
rm(data1,newdf,values)