Cleaning the duplicate names which has certain extensions - r

In the data table the company name column, some companies are coming repeatedly with a different name, e.g. Apple, and Apple _Do not call. I want to consider only one instead. How do I clean those data? The company name which is repeating has the same value for other fields
Company Name Volume
Apple 150
Wallmart 190
Apple_Do Not Call 150
Sapient 450
Apple inc. 150
if you eyeball the data, the Apple company are coming repeatedly with different name. I want to consider 1 value only, i.e. Apple

You can group_by on a different field that has the same values (Volume in this case) then use mutate to change the Company Name to the first value of each group_by group
dt %>% group_by(Volume) %>% mutate(Company_Name = first(Company_Name))
dt here would be your data.table

Related

R xml2 : How to query only corresponding xml nodes

I'm trying to read and transform many XML files into R data frames (or preferably Tibbles).
All R packages I've tried, unfortunately (XML, flatxml, xmlconvert) failed when I tried to convert the files using built-in functions (e.g. xmltodataframe from the XML Package and xml_to_df from the xmlconvert package), so I have to do it manually with XML2.
Here is my question with a small working example:
# Minimal Working Example
library(tidyverse)
library(xml2)
interimxml <- read_xml("<Subdivision>
<Name>Charles</Name>
<Salary>100</Salary>
<Name>Laura</Name>
<Name>Steve</Name>
<Salary>200</Salary>
</Subdivision>")
names <- xml_text(xml_find_all(interimxml ,"//Subdivision/Name"))
salary <- xml_text(xml_find_all(interimxml ,"//Subdivision/Salary"))
names
salary
# combine in to tibble (doesn't work because of inequal vector lengths)
result <- tibble(names=names,
salary = salary)
result
rbind(names, salary)
From the (made up) XML file you can see that Charles earns 100 dollars, Laura earns nothing ( because of the missing entry, here is the problem) and Steve earns 200 dollars.
What I want xml2 do to is, when querying names and salary nodes is to return an "NA" (or zero which would also be okay), when it finds a name but no corresponding salary entry, so that I would end up a nice table like this:
Name
Salary
Charles
100
Laura
NA
Steve
200
I know that I could modify the "xpath" to only pick up the last value (for Steve), which wouldn't help me, since (in the real data) it could also be the 100th or the 23rd person with missing salary information.
[ I'm aware that Salary Numbers are pulled as character values from the xml file. I would mutate(across(salary, as.double) over columns afterwards.]
Any help is highly appreciated. Thank you very much in advance.
You need to be a bit more careful to match up the names and salaries. Basically first find all the <Name> nodes, then check only if their next sibling is a <Salary> node. If not, then return NA.
nameNodes <- xml_find_all(interimxml ,"//Subdivision/Name")
names <- xml_text(nameNodes)
salary <- map_chr(nameNodes, ~xml_text(xml_find_first(., "./following-sibling::*[1][self::Salary]")))
tibble::tibble(names, salary)
# names salary
# <chr> <chr>
# 1 Charles 100
# 2 Laura NA
# 3 Steve 200

how to display data frame variable value in a string of another dataframe in r?

I have a dataframe column with variables taken from another table has been created in new dataframe column
with some text.
this is my df1 dataframe
coltext
------
df[df$ID=="1234",'Name'] bought the expensive product df[df$ID=="1234",'price']
df[df$ID=="231",'Name'] bought the leather product
df[df$ID=="4321",'Name'] bought the spareparts
df[df$ID=="4568",'Name'] bought the expensive product
my df dataframe has name ID and Prince
ID Name price
1234 Rick 333
4568 Jim 555
231 Rex 122
I want to print my df1 column coltext with variable values
like
1. Rick bought the expensive product 333
my code
for(i in 1:nrow(df1)){
print(df1[i,1])
}
but i'm getting same string without values
df[df$ID=="1234",'Name'] bought the expensive product df[df$ID=="1234",'price']
is there a way to use values in place of rcode in string
Try glue package.
One more thing, either use " inside ' ' or ' inside "", but don't mix these.
Use either
df[df$ID=="1234","Name"] bought the expensive product df[df$ID=="1234","price"]
OR
df[df$ID=='1234','Name] bought the expensive product df[df$ID=='1234','price']
but don't use
df[df$ID=="1234",'Name'] bought the expensive product df[df$ID=="1234",'price']
library(glue)
df <- read.table(text = 'ID Name price
1234 Rick 333
4568 Jim 555
231 Rex 122', header = T)
glue('{df[df$ID=="1234","Name"]} bought the expensive product {df[df$ID=="1234", "price"]}')
#> Rick bought the expensive product 333
Created on 2021-05-26 by the reprex package (v2.0.0)

How can I merge two dataframes equivalently to VLookup criteria?

I have just recently started using R for my master thesis. I need to match the ID number (uuid) of dataframe 1 to the investee names in dataframe 2.
Dataframe 1
investee_name uuid
1 Wetpaint e1393508
2 Zoho bf4d7b0e
3 Digg 5f2b40b8
4 Omidyar Network f4d5ab44
5 Facebook df662812
6 Trinity Ventures 7ca12f
Dataframe 2:
investee_name investor_name investor_type
1 Facebook cel organization
2 Facebook Grock Partners organization
3 Facebook Partners organization
4 Photobucket Ventures organization
5 Geni Fund organization
6 Gizmoz Capital organization
As you can see, in Dataframe 2 the investee names appear mutliple times. With VLookup in Excel I could have easily matched the respective IDs from dataframe 1 but for some reason the merging does not work in R.
I have tried the following:
investments_complete <- merge(v2_investments, ID_organizations, by.x= names(v2_investments)[1], by.y= names(ID_organizations)[1])
v2_investments_complete <- (merge(ID_organizations,v2_investments, by = "investee_name"))
for both options it merges the ID colums but I get 0 observations.
At last, I tried this:
v2_investments_merged <- merge(v2_investments, ID_organizations, by.x = "investee_name", by.y = "investee_name", all.x= TRUE)
here the merge works and all needed observations are there but al IDs have the value NA.
Is there any kind of merge function that mirrors the Vlookup that I intend to do? I've spent hours trying to solve this but couldn't, so I would be very grateful for support!
Cheers,
Philipp
It is possible that there are some leading/lagging spaces in the by columns. One option is trimws from base R which would remove the whitespace from both ends (if any)
v2_investments$investee_name <- trimws(v2_investments$investee_name)
ID_organizations$investee_name <- trimws(ID_organizations$investee_name)
Now, the merge should work

R: Merging two data.table while filtering for unique ID: only NA as answers

My problem is as follows:
I need to analyse data from several different files with a lot of entries (up to 500.000 per column, 10 columns in total).
The files are connected through the use of IDs, e.g. ORDER_IDs.
However, the IDs can appear multiple times, e.g. when an order contains multiple order lines. It is also possible, that one ID doesn't appear in one of the files, e.g. because a file with sales data does only have information on the orders shipped, but not those that have not been shippet yet.
So I have different files with different lengths and unique IDs identifying positions that can vary in their appearance (It is there or not) over all the data files.
What I want now is to filter one file by ID so that it only shows the IDs listed in another file. Also, the additional columns from the first file should be moved over.
Example of what I have:
dt1:
ORDER_ID SKU_ID Quantity_Shipped
12345 678910 100
12346 648392 30
64739 648392 20
dt2:
ORDER_ID Country
12345 DE
12346 DE
55430 SE
90632 JPN
76543 ARG
64739 CH
What I want:
ORDER_ID SKU_ID Quantity_Shipped Country
12345 678910 100 DE
12346 648392 30 DE
64739 648392 20 CH
Originally, the data was imported from a csv file.
The approach I used so far has worked when merging two files.
When trying to add the information from a third file hoewever, I get only NA as answers. What can I do to fix this?
This is the approach I used so far.
df2 <- data.frame(ORDER_ID = sales[["ORDER_ID"]])
df1 <- data.frame(ORDER_ID = OL[["ORDER_ID"]], SKU_ID = OL[["SKU_ID"]],
QTY_SHIPPED = OL[["QTY_SHIPPED"]], EXPECTED_VOLUME =
OL[["EXPECTED_VOLUME"]])
library(data.table)
dt2 <- data.table(df1)
dt1 <- data.table(df2)
dt3 <- dt2[match(dt1$ORDER_ID, dt2$ORDER_ID), ]
You can use either the inherent "merge" within data.table, or the explicit merge command (which is also calling a data.table S3 method, but that doesn't entirely matter here).
dt2[dt1, on = "ORDER_ID"]
# ORDER_ID Country SKU_ID Quantity_Shipped
# 1: 12345 DE 678910 100
# 2: 12346 DE 648392 30
# 3: 64739 CH 648392 20
merge(dt1, dt2, by = "ORDER_ID")
Sometimes I prefer to clarity of the merge call in that I control left/right and other aspects (if the default first-use above doesn't do what I want). I found https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html has a good reference for left/right and other join types.
One time when merge will not work sufficiently is if you are doing range-joins, either using data.table::foverlaps or an extension of the inherent method:
# with mythical data, joining on `dat1$val` within `dat2$val` and `dat2$val2`
dat1[dat2, on = .( val >= val1 & val <= val2 )]

Extracting a value based on multiple conditions in R

Quick question - I have a dataframe (severity) that looks like,
industryType relfreq relsev
1 Consumer Products 2.032520 0.419048
2 Biotech/Pharma 0.650407 3.771429
3 Industrial/Construction 1.327913 0.609524
4 Computer Hardware/Electronics 1.571816 2.019048
5 Medical Devices 1.463415 3.028571
6 Software 0.758808 1.314286
7 Business/Consumer Services 0.623306 0.723810
8 Telecommunications 0.650407 4.247619
if I wanted to pull the relfreq of Medical Devices (row 5) - how could I subset just that value?
I was thinking about just indexing and doing severity$relfreq[[5]], but I'd be using this line in a bigger function where the user would specify the industry i.e.
example <- function(industrytype) {
weight <- relfreq of industrytype parameter
thing2 <- thing1*weight
return(thing2)
}
So if I do subset by an index, is there a way R would know which index corresponds to the industry type specified in the function parameter? Or is it easier/a way to just subset the relfreq column by the industry name?
You would require to first select the row of interest and then keep the 2 column you requested (industryType and relfreq).
There is a great package that allows you to do this intuitively with tidyverse library(tidyverse)
data_want <- severity %>%
subset(industryType =="Medical Devices") %>%
select(industryType, relfreq)
Here you read from left to right with the %>% serving as passing the result to the next step as if nesting.
I think that selecting whole row is better, then choose column which you would like to see.
frame <- severity[severity$industryType == 'Medical Devices',]
frame$relfreq

Resources