Join and merge are only working partially and completely in R - r

I am working with a dataset that was obtained from a global environment and loaded into R. It has been saved as a CSV and is being read in R as a data frame from that CSV. This dataset (survey_df) has almost 3 million entries, I am trying to join this dataset based on a column ID (repeated multiple times since there are multiple entries per id) to what originally was a shapefile and is now loaded in R as a data frame shapefile_df . This data frame has 60,000 unique entries, each representing a geometry in a country. We expect to have many entries per geometry in most cases. I am using a simple left_join which should in theory join these two datasets together. I am running into an issue where they are not fully joining together, only some entries are. I have tried inner,fully and right join as well as merge and I keep getting the same issue. I made a full_join and a copy of the id columns to compare the ones that are not joining and I do not see any patterns. They seem to be the same id, they are not joining for some reason. I tried formatting them as.character and as.factor and yet nothing. Below I pasted a sample of the join/unjoined df.
Matched ids
| survey_df_id | survey_id_copy | shapefile_df_id
-------------- | -------------- |--------------
0901200010229 | 0901200010229 | 0901200010229
0901500010729 | 0901500010729 | 0901500010729
090050001087A | 090050001087A | 090050001087A
0900600010467 | 0900600010467 | 0900600010467
0901400010897 | 0901400010897 | 0901400010897
0901200011960 | 0901200011960 | 0901200011960
Unmatched ids
| survey_df_id | survey_id_copy | shapefile_df_id
-------------- | -------------- |--------------
01903900010480 | 01903900010480 | NA
070470001010A | NA | 070470001010A
0704700010117 | NA | 0704700010117
0704700010140 | NA | 0704700010140
0705200010672 | NA | 0705200010672
0705200010742 | NA | 0705200010742
Most of the entries that are unmatched are like the first row where shapefile_df_id is NA. However, there are a few where survey_id_copy is NA. This field is simply a mutate of survey_df_id and in theory should not be any different yet they are. Any idea what could be causing this? I suspect this is a formatting issue but as a said, using as. hasn't fixed the issue. I am using tidyverse and read.csv. Any help?

Related

filter information in a dataframe according multiple variables in R

the function 'filter' in R allows to filter some information, usually, someone wants to filter one or two values, however, I want to filter some values according some variables (in my case two) of another data frame. I type an example about I have and I need.
Main data frame:
df_root<-data.frame('id'=c('1a','1a','2a','2a'),
'zone'=c('II', 'I', 'I', 'II'),
'date'=c(1,6,1,5))
Alternative data frame:
df_alternative<-data.frame('id'=c('1a', '2a'),
'date'=c(6,5))
So, I need to filter the information from the second data frame into the first, according the variables df_alternative$id and df_alternative$date. The result must be as following.
df_root_filtered:
| id | zone | date|
| ---- | ---- |-----|
| 1a | I | 6 |
| 2a | II | 5 |
Thanks for your help
It sound like you want a left (or right) join, so you want all rows from df_root that have corresponding rows in df_alternative, correct? If so, you can use the merge function.
merged_tables <- merge(df_root, df_alternative, all.x = FALSE, all.y = TRUE)

My R data frame has 2 separate tables that sit in the same columns, but the column names are not the same. How can I subset/extract the bottom table?

I am loading a csv file into R Studio that has two separate tables that fall within the same columns on the file, but the second table has a separate field(column name) that I need to separate and add to the first table (The csv is structured similarly to the code below)
#1| Date | Figure1| Figure2|
----------------------------
#2|1/1/20| 10 | 15 |
#3|1/8/20| 20 | 25 |
...
...
#56| Date | Figure3|
--------------------
#57|1/1/20| 18 |
#58|1/8/20| 16 |
I need a way for R to read all of the rows up until the occurrence of "Total3" in the 2nd columns and put that into its own data frame (df1), as well as read all of the rows after the occurrence of "Total3" in the second columns and separate the into its own data frame (df2) so that I can merge these data frames into one single table. The csv that I am pulling is updated every week so I am unable to hard index the row numbers in order to do this (rows are added to both tables). Ultimately I need it to look something like this.
#1| Date | Figure1| Figure2| Figure3|
-------------------------------------
#2|1/1/20| 10 | 15 | 18 |
#3|1/8/20| 20 | 25 | 16 |
I have looked into using stringr's string_extract, but have not been able to make this work for my case, thank you for any help.

Plotting mean-values of elements with same frequencies

I have two columns in my google sheet that corresponds to 1) the frequencies of the elements and 2) their respective 'values'.
What I want is a diagram that holds the different frequencies on the x-axis, and for each frequency I want the y-axis to hold that specific frequency's value (and if there are more than one element with that frequency I want it to plot their mean value).
Two elements can share the same frequency and/or the same score, and that's why I want the mean-functionality added aswell.
If the following data would be my values:
280 6
280 4
250 2
240 1
230 3
Forgive my ascii-skills, but I'd want the graph to plot the following in that case:
^
.
.
|
| |
| |
| | |
| | | |
| | | | |
___230___240___250___260___270___280___...>
I'm not entirely familiar with Google Sheets yet and I'm not really sure how to accomplish this.
I think a pivot table should serve. Your LH column in Rows, RH in Values, with Summarise by AVERAGE. Then chart the results (select what in the image is B14:B17, Insert..., Chart and accept the first recommendation):

Filter on a Query Item

I am creating a report and have a field that has multiple values representing different data values. i.e 4-Completeness 5-accuracy etc... What I need to do is make multiple columns where that field is filtered down to one value. The problem is I get the error if I try and edit the query item in the report of 'Boolen value expression as query item is not supported' How do I fix?
example:
ID column | Data Value = 4 | Actual Data | Data Value = 5
EDIT:
I currently have a case when [Data value] = 4 then [percentage] for the different columns but I am still getting wrong output. I am getting
ID1 | 45% | | |
ID1 | | 35% | |
ID1 | | | 67% |
I need all of ID1 to be in one row.
You can fix this by totaling by ID which will combine all three rows in your example to one:
total([Measure] for [ID])
Change each of the three percentage columns to use this expression, substituting their respective data item for [Measure].
Normally, you don't want to total percentages, but this is an exception. Since only one row has actual data, the total will match that row and the other two null values will not be included in the total.
Simple way would be to do it for each data value in three queries and join them on ID1

Selenium IDE, identify row in table based on 3 columns

I am trying to find a row in a table which contains specific values on three columns.
I have tried methods in #paul trmbrth's answer to find XPath to identify cell in table based on other column. Worked fine for 2 columns, but didn't worked with 3. I didn't find any example for cases with more than 2 values.
VEHICLE CATEGORY | CATEGORY | SUBCATEGORY
A | Exteriors | Badges
A | Exteriors | Badges
A | Exteriors | Mirrors
A | Interiors | Wheels
A | Interiors | Rears
Want cell with the combination that contains:
A | Exteriors | Mirrors
I have tried but no success:
//tr[contains(td[1], 'A')]/td[2][contains(., 'Exterior')] td[3][contains(., 'Mirror')]
//tr[contains(td[1], 'A')]/td[2][contains(., 'Exterior')] /td[3][contains(., 'Mirror')]
css=tr([td:contains('A')][td:contains('Exterior')][td:contains('Mirror')])
css=tr([td:contains('A')][td:contains('Exterior')][td:contains('Mirror')])
Can anyone help?
I think you have a couple of typos:
//tr[contains(td[1], '1') and contains(td[2], 'Eve') and contains(td[3], 'Jackson')]
But I'm not 100% this is most efficient, but it will work.

Resources