I have df1:
City Freq
Seattle 20
San Jose 10
SEATTLE 5
SAN JOSE 15
Miami 12
I created this dataframe using table(df)
I have another df2:
City
San Jose
Miami
I want to subset df1 if the city values in df1 equal to those in df2. This df2 is only a sample so I can't use an OR condition ( " | " ) because I have many different criteria. Perhaps I could convert this df2 into a vector.. but I'm not sure how to do this. as.vector() doesn't seem to work.
I thought about using
subset(df1, City == df2)
but this gives me errors.
Also, if you guys could get me a way to make this case insensitive such that "San Jose" and "SAN JOSE" are added together, that would be even better!
If I use "toupper / tolower", I get the error: invalid multibyte
Thanks in advance!!
Here are few more methods
R Code:
# Method 1: using dplyr package
library(dplyr)
filter(df1, tolower(df1$City) %in% tolower(df2$City))
df1 %>% filter(tolower(df1$City) %in% tolower(df2$City))
# Method 2: using which function
df1[ which( tolower(df1$City) %in% tolower(df2$City)) , ]
# Method 3:
df1[(tolower(df1$City) %in% tolower(df2$City)), ]
Output:
City Freq
2 San Jose 10
4 SAN JOSE 15
5 Miami 12
Hope this helps.
Related
I have a csv file that I have read in but I now need to combine every two rows together. There is a total of 2000 rows but I need to reduce to 1000 rows. Every two rows is has the same account number in one column and the address split into two rows in another. Two rows are taken up for each observation and I want to combine two address rows into one. For example rows 1 and 2 are Acct# 1234 and have 123 Hollywood Blvd and LA California 90028 on their own lines respectively.
Using the tidyverse, you can group_by the Acct number and summarise with str_c:
library(tidyverse)
df %>%
group_by(Acct) %>%
summarise(Address = str_c(Address, collapse = " "))
# A tibble: 2 × 2
Acct Address
<dbl> <chr>
1 1234 123 Hollywood Blvd LA California 90028
2 4321 55 Park Avenue NY New York State 6666
Data:
df <- data.frame(
Acct = c(1234, 1234, 4321, 4321),
Address = c("123 Hollywood Blvd", "LA California 90028",
"55 Park Avenue", "NY New York State 6666")
)
It can be fairly simple with data.table package:
# assuming `dataset` is the name of your dataset, column with account number is called 'actN' and column with adress is 'adr'
library(data.table)
dataset2 <- data.table(dataset)[,.(whole = paste0(adr, collapse = ", ")), by = .(adr)]
I have tried multiple reg exp for resolving this problem but none of them is correct.
I have a data frame like this:
df <- data.frame("Name" = c("Antonio Garcia Fernandez", "Mark Wahlberg", "Juan Antonio Frontera Márquez", "Jose Maria Alvarez Sainz"))
print(df)
And I would like to get the result as a new data frame in which the names would be those which have less than 3 whitespaces between characters:
Name
Antonio Garcia Fernandez
Mark Wahlberg
Can someone give me the reg exp which can filters for those values which contains less than 3 whitespaces?
Thanks in advance!
Using filter
library(dplyr)
library(stringr)
df %>%
filter(str_count(Name, '\\w+') <=3)
Name
1 Antonio Garcia Fernandez
2 Mark Wahlberg
Some base R options:
subset(
df,
nchar(gsub("[^ ]", "", Name)) < 3
)
or
subset(
df,
lengths(regmatches(Name,gregexpr(" ",Name)))< 3
)
subset(df, nchar(gsub(pattern = "\\S", "", df$Name)) < 3)
Name
1 Antonio Garcia Fernandez
2 Mark Wahlberg
As #Quixotic22 mentioned you can use str_count to count the number of words and keep the rows which has less than equal to 3 words in them.
df[stringr::str_count(df$Name, '\\w+') <= 3 , , drop = FALSE]
# Name
#1 Antonio Garcia Fernandez
#2 Mark Wahlberg
I have two dataframe:
df1: Names of staff in my organization.
df2: Names of staff in 10 different organizations
I would like to find people listed in df1 from df2. In particular, I would like to make an additional variable showing whether the names in df2 is overlapped with names in df1 (yes:1, no:0)
How should I code this?
Thanks
You can try something like this:
Use data.table to check for matches between df1 and df2 on staff_names column.
library(data.table)
Manually create data tables
df1 <- data.table(staff_names = c("John Appleseed", "Daniel Lewis", "Todd Smith"))
df2 <- data.table(staff_names = c("John Appleseed", "Greg Scott", "Tony Hawk"))
Code:
df3 <- df1[df2, on=c(staff_names="staff_names"), overlap:="1"]
df3[is.na(df3)] <- 0
#> staff_names overlap
#> 1: John Appleseed 1
#> 2: Daniel Lewis 0
#> 3: Todd Smith 0
Created on 2020-08-08 by the reprex package (v0.3.0)
This problem involves R. I have two dataframes, represented by this minimal reproducible example:
a <- data.frame(geocode_selector = c("36005", "36047", "36061", "36081", "36085"), county_name = c("Bronx", "Kings", "New York", "Queens", "Richmond"))
b <- data.frame(geocode = c("360050002001002", "360850323001019"), jobs = c("4", "204"))
An example to help communicate the very specific operation I am trying to perform: the geocode_selector column in dataframe a contains the FIPS county codes of the five boroughs of NY. The geocode column in dataframe b is the 15-digit ID of a specific Census block. The first five digits of a geocode match a more general geocode_selector, indicating which county the Census block is located in. I want to add a column to b specifying which county each census block falls under, based on which geocode_selector each geocode in b matches with.
Generally, I'm trying to merge dataframes based on a regex condition. Ideally, I'd like to perform a full merge carrying all of the columns of a over to b and not just the county_name.
I tried something along the lines of:
b[, "county_name"] <- NA
for (i in 1:nrow(b)) {
for (j in 1:nrow(a)) {.
if (grepl(data.a$geocode_selector[j], b$geocode[i]) == TRUE) {
b$county_name[i] <- a$county_name[j]
}
}
}
but it took an extremely long time for the large datasets I am actually processing and the finished product was not what I wanted.
Any insight on how to merge dataframes conditionally based on a regex condition would be much appreciated.
You could do this...
b$geocode_selector <- substr(b$geocode,1,5)
b2 <- merge(b, a, all.x=TRUE) #by default it will merge on common column names
b2
geocode_selector geocode jobs county_name
1 36005 360050002001002 4 Bronx
2 36085 360850323001019 204 Richmond
If you wish, you can delete the geocode_selector column from b2 with b2[,1] <- NULL
We can use sub to create the 'geocode_selector' and then do the join
library(data.table)
setDT(a)[as.data.table(b)[, geocode_selector := sub('^(.{5}).*', '\\1', geocode)],
on = .(geocode_selector)]
# geocode_selector county_name geocode jobs
#1: 36005 Bronx 360050002001002 4
#2: 36085 Richmond 360850323001019 204
This is a great opportunity to use dplyr. I also tend to like the string handling functions in stringr, such as str_sub.
library(dplyr)
library(stringr)
a <- data_frame(geocode_selector = c("36005", "36047", "36061", "36081", "36085"),
county_name = c("Bronx", "Kings", "New York", "Queens", "Richmond"))
b <- data_frame(geocode = c("360050002001002", "360850323001019"),
jobs = c("4", "204"))
b %>%
mutate(geocode_selector = str_sub(geocode, end = 5)) %>%
inner_join(a, by = "geocode_selector")
#> # A tibble: 2 x 4
#> geocode jobs geocode_selector county_name
#> <chr> <chr> <chr> <chr>
#> 1 360050002001002 4 36005 Bronx
#> 2 360850323001019 204 36085 Richmond
Apologize if the title of the question is not so clear.
I have two data frame as below:
df1
NAME FOLLOWS
san big supa
san EAU
san simulate
san spang
glyn guido
glyn claire
glyn vincent
glyn dan
glyn peter
glyn EAU
df2
FOLLOWS
guido
vincent
EAU
EUSC
brian
simulate
peter
I would like to count matches betweendf1$FOLLOWS and df2$FOLLOWS for each NAME in df1 and also the length of df1$FOLLOWS for each NAME in df1. For these data frame, I am expecting output like this:
df3
NAME LENGTH_FOLLOWS COUNT_Match
san 4 2
glyn 6 4
You can merge df1 with df2 first which will keep only values present in df1. then you can simply count the instance.
library(sqldf)
sqldf('select NAME, count(NAME) as LENGTH_FOLLOWS , count(Actual_F) as COUNT_Match from (select t1.*, t2.FOLLOWS as Actual_F from df1 t1 left join df2 t2 on t1.FOLLOWS=t2.FOLLOWS) group by NAME')
Or using base R
df1$index=match(df1$FOLLOWS, df2$FOLLOWS)
aggregate(cbind(df1$FOLLOWS,df1$index), by = list(df1$NAME) , FUN = function(x) length(x[!is.na(x)]))
Here is an option using data.table. Convert the first data.frame to 'data.table' (setDT(df1)) and join on with the 'df2' to create an index column ('ind'). Then, grouped by 'NAME', we get the number of rows (.N) and the sum of logical vector of non-NA elements in 'ind'
library(data.table)
setDT(df1)[df2, ind := 1, on = .(FOLLOWS)]
df1[, .(LENGTH_FOLLOWS = .N, COUNT_MATCH = sum(!is.na(ind))), NAME]
# NAME LENGTH_FOLLOWS COUNT_MATCH
#1: san 4 2
#2: glyn 6 4