Search a column of names in another data frame and get the result with data from other column combined - r

I want to create a data frame based in two data frames distincts.
The first one has the name of journals ans its respective impact factor.
The second data frame has the names of the journals that I want to search.
df1:
Full Journal Title Journal Impact Factor
CA-A CANCER JOURNAL FOR CLINICIANS 223.679
Nature Reviews Materials 74.449
NEW ENGLAND JOURNAL OF MEDICINE 70.670
LANCET 59.102
NATURE REVIEWS DRUG DISCOVERY 57.618
CHEMICAL REVIEWS 54.301
Nature Energy 54.000
NATURE REVIEWS CANCER 51.848
JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION 51.273
NATURE REVIEWS IMMUNOLOGY 44.019
NATURE REVIEWS GENETICS 43.704
NATURE REVIEWS MOLECULAR CELL BIOLOGY 43.351
NATURE 43.070
and continues...
str(df1)
data.frame': 12541 obs. of 2 variables:
$ my.journal: Factor w/ 11879 levels "","2D Materials",..: 4155 1872 8866 8999 8033 8861 2143 8841 8856 5795 ...
$ jcr : Factor w/ 4732 levels "","0.000","0.006",..: 4731 2905 4614 4613 4337 4336 4335 4334 4333 4332 ...
df2:
my.journal
1 Bioscience journal
2 Summa phytopathologica (impresso)
3 Summa phytopathologica (impresso)
4 Summa phytopathologica (impresso)
5 Australian journal of crop science (online)
6 Summa phytopathologica (impresso)
7 Summa phytopathologica
8 Pesquisa agropecuaria tropical (online)
9 Crop breeding and applied biotechnology
10 Genetics and molecular research
11 Tropical plant pathology
12 Genetics and molecular research
13 Perspectivas online: biológicas e saúde
14 Científica (jaboticabal. online)
15 Journal of plant physiology & pathology
16 Tropical plant pathology
17 Summa phytopathologica (impresso)
> str(df2)
'data.frame': 17 obs. of 1 variable:
$ my.journal: Factor w/ 11 levels "Australian journal of crop science (online)",..: 2 10 10 10 1 10 9 8 4 5 ...
I want another df (df3) where the journals in df2 where searched in df1 and if match give me something like this (Without the NA):
In NA place i want the Journal Impact Factor correspondet to the journal in my df2.
df3
journal jcr total
<chr> <fct> <int>
1 Summa phytopathologica (impresso) NA 5
2 Genetics and molecular research NA 2
3 Tropical plant pathology NA 2
4 Australian journal of crop science (online) NA 1
5 Bioscience journal NA 1
6 Científica (jaboticabal. online) NA 1
7 Crop breeding and applied biotechnology NA 1
8 Journal of plant physiology & pathology NA 1
9 Perspectivas online: biológicas e saúde NA 1
10 Pesquisa agropecuaria tropical (online) NA 1
11 Summa phytopathologica NA 1
I'm starting using R a few months and I don't know how to start to resolve this.
The two dataframes are in the link df1 and df2

Updated:
One solution would be to use join with dplyr:
library(dplyr)
df1 <- read.table("df1.txt", skip = 1, header = TRUE, stringsAsFactors = FALSE)
df2 <- read.table("df2.txt", header = TRUE, stringsAsFactors = FALSE)
df1 <- df1 %>%
mutate(Full.Journal.Title = toupper(Full.Journal.Title))
df2 <- df2 %>%
mutate(my.journal = toupper(my.journal))
df2 %>%
left_join(df1, by = c("my.journal" = "Full.Journal.Title")) %>%
group_by(my.journal, Journal.Impact.Factor) %>%
summarize(total = n()) %>%
arrange(desc(total))
my.journal Journal.Impact.Factor total
<chr> <chr> <int>
1 SUMMA PHYTOPATHOLOGICA (IMPRESSO) NA 5
2 GENETICS AND MOLECULAR RESEARCH NA 2
3 TROPICAL PLANT PATHOLOGY 1.254 2
4 AUSTRALIAN JOURNAL OF CROP SCIENCE (ONLINE) NA 1
5 BIOSCIENCE JOURNAL 0.375 1
6 CIENTíFICA (JABOTICABAL. ONLINE) NA 1
7 CROP BREEDING AND APPLIED BIOTECHNOLOGY 1.026 1
8 JOURNAL OF PLANT PHYSIOLOGY & PATHOLOGY NA 1
9 PERSPECTIVAS ONLINE: BIOLóGICAS E SAúDE NA 1
10 PESQUISA AGROPECUARIA TROPICAL (ONLINE) NA 1
11 SUMMA PHYTOPATHOLOGICA NA 1
A few things to note to make this work:
Reading in df1 the header appears to take up 2 rows, so skipped the first line (since this more closely matched your previous example)
read.table includes stringsAsFactors = FALSE if you do not want them as factors
Some journal names are upper case, others lower case. The join is case-sensitive so included toupper to make everything upper case before the join (as an alternative, you can embed the toupper inside the left_join if you want to leave the original data frames untouched)
Please let me know if this is what you had in mind.

Related

approximate character matching using R

I have two datafiles. One of the files contains only one column with the name of the company (usually a hospital) and the other one contains a list of companies with the respective adresses. The problem is that the company names do not exactly match. How can i match them approximately ?
> dput(head(HOSPITALS[130:140,], 10))
I would like to obtain one datafile, where the company is matchen with an adress, if available in adress
Check out the fuzzyjoin package and the stringdist_join functions.
Here's a starting point. In your example data ignore_case = TRUE solves the matching problem. Depending on how the full data looks, you will have to experiment with the arguments (e.g. max_dist) and possibly filter the result until your achieve what you want.
library(dplyr)
library(fuzzyjoin)
HOSPITALS %>%
stringdist_left_join(GH_MY,
by = c("hospital" = "hospital_name"),
ignore_case = TRUE,
max_dist = 2,
distance_col = "dist")
Result:
# A tibble: 10 x 6
hospital hospital_name adress district town dist
<chr> <chr> <chr> <chr> <chr> <dbl>
1 HOSPITAL PAPAR Hospital Papar Peti Surat No. 6, Papar Sabah 0
2 HOSPITAL PARIT BUNT~ Hospital Parit ~ Jalan Sempadan Parit Bun~ Perak 0
3 HOSPITAL PEKAN Hospital Pekan 26600 Pekan Pekan Pahang 0
4 HOSPITAL PENAWAR SD~ NA NA NA NA NA
5 HOSPITAL PORT DICKS~ Hospital Port D~ KM 11, Jalan Pantai Port Dick~ Negeri ~ 0
6 HOSPITAL PULAU PINA~ Hospital Pulau ~ Jalan Residensi Pulau Pin~ Pulau P~ 0
7 HOSPITAL PUSRAWI SD~ NA NA NA NA NA
8 HOSPITAL PUSRAWI SM~ NA NA NA NA NA
9 HOSPITAL PUTRAJAYA Hospital Putraj~ Pusat Pentadbiran Ker~ Putrajaya WP Putr~ 0
10 HOSPITAL QUEEN ELIZ~ NA NA NA NA NA

Importing .csv file with tidydata

I am having difficulty importing my data in the way I would like to from a .csv file to tidydata.
My data set is made up of descriptive data (age, country, etc.) and then 15 condition columns that I would like to have in just one column (long format). I have previously tried 'melting' the data in a few ways, but it does not turn out the way I intended it to. These are a few things I have tried, I know it is kind of messy. There are quite a few NAs in the data, which seem to be causing an issue. I am trying to create this specific column "Vignette" which will serve as the collective column for the 15 vignette columns I would like in long format.
head(dat)
ID Frequency Gender Country Continent Age
1 5129615189 At least weekly female France Europe 30-50 years
2 5128877943 At least daily female Spain Europe > 50 years
3 5126775994 At least weekly female Spain Europe 30-50 years
4 5126598863 At least daily male Albania Europe 30-50 years
5 5124909744 At least daily female Ireland Europe > 50 years
6 5122047758 At least weekly female Denmark Europe 30-50 years
Practice Specialty Seniority AMS
1 University public hospital centre Infectious diseases 6-10 years Yes
2 Other public hospital Infectious diseases > 10 years Yes
3 University public hospital centre Intensive care > 10 years Yes
4 University public hospital centre Infectious diseases > 10 years No
5 Private hospial/clinic Clinical microbiology > 10 years Yes
6 University public hospital centre Infectious diseases 0-5 years Yes
Durations V01 V02 V03 V04 V05 V06 V07 V08 V09 V10 V11 V12 V13 V14 V15
1 range 7 2 7 7 7 5 7 14 7 42 42 90 7 NA 5
2 range 7 10 10 5 14 5 7 14 10 42 21 42 14 14 14
3 range 7 5 5 7 14 5 5 13 10 42 42 42 5 0 7
4 range 10 7 7 5 7 10 7 5 7 28 14 42 10 10 7
5 range 7 5 7 7 14 7 7 14 10 42 42 90 10 0 7
6 fixed duration 7 3 3 7 10 10 7 14 7 90 90 90 10 7 7
dat_long %>%
gather(Days, Age, -Vignette)
dat$new_sp = NULL
names(dat) <- gsub("new_sp", "", names(dat))
dat_tidy<-melt(
data=dat,
id=0:180,
variable.name="Vignette",
value.name="Days",
na.rm=TRUE
)
dat_tidy<- mutate(dat_tidy,
Days= sub("^V", "", Days)
)
It keeps saying "Error: id variables not found in data: NA"
I have tried to get rid of NAs but it doesn't seem to do anything.
I am guessing you are loading the melt function from reshape2. I will recommend that you try tidyr which is basically the next generation of reshape2.
Your error is presumable that the argument id=0:180. This is basically asking it to keep columns 0-180 as "identifier" columns, and melt the rest (i.e. create a new row for each value in each column).
When you subset more column indices than columns in a data.frame, the non-existing columns are filled in with pure NA - you asked for them, so you get them!
I would recommend loading tidyr, as it is newer. There should be some new verbs in the package that are more intuitive, but I'll give you a solution with the older semantic:
library(tidyr)
dat_tidy <- dat %>% gather('Vignette', 'Days', starts_with('V'))
# or a bit more verbose
dat_tidy <- dat %>% gather('Vignette', 'Days', V01, V02, V03, V04)
And check out the comment #heck1 for asking even better questions.

cbind arguments in large dataframe

I have searched unsuccessfully for several days for an answer to this question: I have a dataframe with 279 columns and want to generate subtotals using aggregate(), or indeed, anything suitable. Here is a subset:
LGA off.cat sub.cat Jan1995 Feb1995
1 Albury Homicide Murder * 0 0
2 Albury Homicide Attempted murder 0 0
3 Albury Homicide Murder accessory, conspiracy 0 0
4 Albury Homicide Manslaughter * 0 0
5 Albury Assault Domestic violence related assault 7 7
6 Albury Assault Non-domestic violence related assault 29 20
7 Albury Assault Assault Police 12 3
8 Albury Sexual offences Sexual assault 4 3
The full dataframe contains dozens of LGA values, and many more date columns. I would like to obtain subtotals for each unique LGA value grouped by unique values of off.cat and sub.cat, summed over all dates. I tried using cbind in aggregate, but found no way to generate the 276 date column names that would not cause errors. Explicit column names worked fine. Apologies for the lack of clarity in the earlier post, and thanks to those who valiantly tried to interpret my meaning.
Your question is a bit unclear, but you may be successful using the formula syntax of aggregate. Here's an example:
df <- data.frame(group = letters[1:5],
x = 1:5,
y = 6:10,
z = 11:15)
group x y z
1 a 1 6 11
2 b 2 7 12
3 c 3 8 13
4 d 4 9 14
5 e 5 10 15
We now sum all three variables x, y and z by the levels of group, using setdiff to get a vector of column names except group, and pasting them together to use in as.formula:
aggregate(as.formula(paste(paste(setdiff(names(df), c("group")), collapse = "+"), "~ group")), data = df, sum)
group x + y + z
1 a 18
2 b 21
3 c 24
4 d 27
5 e 30
Hope this helps.

How can I overcome this error Error in tbl_vars(y) : argument "y" is missing, with no default?

I am trying to perform an inner join on 2 tables.
One is a hotel dataset which I have tokenized before using
df1 = read.csv("chennai.csv", header = TRUE, stringsAsFactors=FALSE)
library(dplyr)
library(tidytext)
hotel <- df1 %>% unnest_tokens(word,Review_Text)
data("stop_words")
hotel <- hotel %>%
anti_join(stop_words)
head(hotel)
Hotel_name Review_Title Sentiment
1 Accord Metropolitan Excellent comfortableness during stay 3
2 Accord Metropolitan Excellent comfortableness during stay 3
3 Accord Metropolitan Excellent comfortableness during stay 3
4 Accord Metropolitan Excellent comfortableness during stay 3
5 Accord Metropolitan Excellent comfortableness during stay 3
6 Accord Metropolitan Not too comfortable 1
Rating_Percentage X X.1 X.2 X.3 word
1 100 NA NA NA nice
2 100 NA NA NA stay
3 100 NA NA NA business
4 100 NA NA NA tourist
5 100 NA NA NA purpose
6 20 NA NA NA hotel
I have also used a simplified version of General Inquirer Dictionary spreadsheet
df <- read.csv("ib.csv", header=T, stringsAsFactors=FALSE)
dat <-subset(df, select=c(2,1))
head(dat)
word Scoree
1 A
2 ABANDON Negativ
3 ABANDONMENT Negativ
4 ABATE Negativ
5 ABATEMENT
6 ABDICATE Negativ
I have tried to do an inner_join where I encounter this error.
observation<- hotel %>%
+ inner_join(dat, by = "word") %>%
+ count(Scoree)

Arrange dataframe for pairwise correlations

I am working with data in the following form:
Country Player Goals
"USA" "Tim" 0
"USA" "Tim" 0
"USA" "Dempsey" 3
"USA" "Dempsey" 5
"Brasil" "Neymar" 6
"Brasil" "Neymar" 2
"Brasil" "Hulk" 5
"Brasil" "Luiz" 2
"England" "Rooney" 4
"England" "Stewart" 2
Each row represents the number of goals that a player scored per game, and also contains that player's country. I would like to have the data in the form such that I can run pairwise correlations to see whether being from the same country has some association with the number of goals that a player scores. The data would look like this:
Player_1 Player_2
0 8 # Tim Dempsey
8 5 # Neymar Hulk
8 2 # Neymar Luiz
5 2 # Hulk Luiz
4 2 # Rooney Stewart
(You can ignore the comments, they are there simply to clarify what each row contains).
How would I do this?
table(df$player)
gets me the number of goals per player, but then how to I generate these pairwise combinations?
This is a pretty classic self-join problem. I'm gonna start by summarizing your data to get the total goals for each player. I like dplyr for this, but aggregate or data.table work just fine too.
library(dplyr)
df <- df %>% group_by(Player, Country) %>% dplyr::summarize(Goals = sum(Goals))
> df
Source: local data frame [7 x 3]
Groups: Player
Player Country Goals
1 Dempsey USA 8
2 Hulk Brasil 5
3 Luiz Brasil 2
4 Neymar Brasil 8
5 Rooney England 4
6 Stewart England 2
7 Tim USA 0
Then, using good old merge, we join it to itself based on country, and then so we don't get each row twice (Dempsey, Tim and Tim, Dempsey---not to mention Dempsey, Dempsey), we'll subset it so that Player.x is alphabetically before Player.y. Since I already loaded dplyr I'll use filter, but subset would do the same thing.
df2 <- merge(df, df, by.x = "Country", by.y = "Country")
df2 <- filter(df2, as.character(Player.x) < as.character(Player.y))
> df2
Country Player.x Goals.x Player.y Goals.y
2 Brasil Hulk 5 Luiz 2
3 Brasil Hulk 5 Neymar 8
6 Brasil Luiz 2 Neymar 8
11 England Rooney 4 Stewart 2
15 USA Dempsey 8 Tim 0
The self-join could be done in dplyr if we made a little copy of the data and renamed the Player and Goals columns so they wouldn't be joined on. Since merge is pretty smart about the renaming, it's easier in this case.
There is probably a smarter way to get from the aggregated data to the pairs, but assuming your data is not too big (national soccer data), you can always do something like:
A<-aggregate(df$Goals~df$Player+df$Country,data=df,sum)
players_in_c<-table(A[,2])
dat<-NULL
for(i in levels(df$Country)) {
count<-players_in_c[i]
pair<-combn(count,m=2)
B<-A[A[,2]==i,]
dat<-rbind(dat, cbind(B[pair[1,],],B[pair[2,],]) )
}
dat
> dat
df$Player df$Country df$Goals df$Player df$Country df$Goals
1 Hulk Brasil 5 Luiz Brasil 2
1.1 Hulk Brasil 5 Neymar Brasil 8
2 Luiz Brasil 2 Neymar Brasil 8
4 Rooney England 4 Stewart England 2
6 Dempsey USA 8 Tim USA 0

Resources