I want to take the nth row in a dataframe and compare it to all rows that are not the nth row and return how many of this columns match and/or mismatch.
I tried the match function and ifelse for single observations but I haven't been able to replicate it for the entire dataframe.
The dataset Superstore contains the order priority, customer name, ship mode, customer segment and category. It looks like this:
> head(df2)
Order.Priority Customer.Name Ship.Mode Customer.Segment Product.Category
1 Not Specified Dana Teague Regular Air Corporate Office Supplies
2 Critical Vanessa Boyer Regular Air Consumer Office Supplies
3 Critical Wesley Tate Regular Air Corporate Technology
4 High Brian Grady Delivery Truck Corporate Furniture
5 Medium Kristine Connolly Delivery Truck Corporate Furniture
6 High Emily Britt Regular Air Corporate Office Supplies
The code I tried (extracting relevant columns):
df <- read.csv("Superstore.csv", header = TRUE)
df2 <- df[,c(2,4,5,6,7)]
match(df2[2,],df2[1,],nomatch = 0)
This returns:
> match(df2[2,],df2[1,],nomatch = 0)
[1] 0 0 3 0 5
Using ifelse I get:
> ifelse(df2[1,]==df2[2,],1,0)
Order.Priority Customer.Name Ship.Mode Customer.Segment Product.Category
1 0 0 1 0 1
Like I said, this is exactly the result I need, but I haven't been able to replicate for the whole dataframe.
Related
This is my first question on stackoverflow, so feel free to criticize the question.
For every row in a data set, I would like to sum the rows that:
have identical 'team', 'season' and 'simulation_ID'.
have 'match_ID' smaller than (and not equal to) the current 'match_ID'.
such that I find the accumulated number of points up to that match, for that team, season and simulation_ID, i.e. cumsum(simulation$team_points).
I have issues to implement the second condition without using an extremely slow for-loop.
The data looks like this:
match_ID
season
simulation_ID
home_team
team
match_result
team_points
2084
2020-2021
1
TRUE
Liverpool
Away win
0
2084
2020-2021
2
TRUE
Liverpool
Draw
1
2084
2020-2021
3
TRUE
Liverpool
Away win
0
2084
2020-2021
4
TRUE
Liverpool
Away win
0
2084
2020-2021
5
TRUE
Liverpool
Home win
3
2084
2020-2021
1
FALSE
Burnley
Home win
0
2084
2020-2021
2
FALSE
Burnley
Draw
1
My current solution is:
simulation$accumulated_points <- 0
for (row in 1:nrow(simulation)) {
simulation$accumulated_points[row] <-
sum(simulation$team_points[simulation$season==simulation$season[row] &
simulation$match_ID<simulation$match_ID[row] &
simulation$simulation_ID==simulation$simulation_ID[row] &
simulation$team==simulation$team[row]], na.rm = TRUE)
}
This works, but it is obviously too slow to use on large data sets. I cannot figure out how to speed it up. What is a good solution here?
For loops are always slow in scripting languages like R and should best be avoided. This can be done using "vectorized operations", that apply a function to a vector rather than each element separately. Native functions in R or popular packages often rely on optimized C++ code and linear algebra libraries under the hood to do this, such that operations become much faster than a loop in R. For example, your CPU is usually able to process dozens of vector elements at the same time rather than going 1-by-1 as in a for loop. You can find more information about vectorization in this question.
In your specific example, you could for example use dplyr to transform your data:
library(dplyr)
df %>%
# you want to perform the same operation for each of the groups
group_by(team, season, simulationID) %>%
# within each group, order the data by match_ID (ascending)
arrange(match_ID) %>%
# take the vector team_points in each group then calculate its cumsum
# write that cumsum into a new column named "points"
mutate(points = cumsum(team_points))
The code above essentially decomposes the match_points column into one vector for each group that you care about, then applies a single, highly optimized operation to each of them.
I want to be able to keep all rows where the "conm" column does contain certain bank names. you can tell from the code I am trying to use subset to do this but to no avail.
I have tried using subset to do this.
CMPSTPRFT12 <- subset(CMPSPRFT11, conm = MORGUARD CORP | conm = LEHMAN BROTHERS HOLDINGS INC)
I expect the output in rstudio to just show all rows where the column containing the names of banks includes certain banks, not all banks. I want SUnTrust, Lehman Brothers, Morgan Stanley, Goldman Sachs, PennyMac, Bank of America, and Fannie Mae.
Please see other posts on how to phrase your questions more helpfully for others. How to make a great R reproducible example
You can use dplyr and filter.
df <- data.frame(bank=letters[1:10],
value=10:19)
df %>% filter(bank=='a' | bank=='b')
bank value
1 a 10
2 b 11
banks <- c('d','g','j')
df %>% filter(bank %in% banks)
bank value
1 d 13
2 g 16
3 j 19
CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')
I want to get either:
CompanyName2
Kraft
Kraft
Kraft
nestle
nestle
general motors
general motors
Dow
Dow
But would be absolutely fine with:
CompanyName2
1
1
1
2
2
3
3
I see algorithms for getting the distance between two words, so if I had just one weird name I would compare it to all other names and pick the one with the lowest distance. But I have thousands of names and want to group them all into groups.
I do not know anything about elastic search, but would one of the functions in the elastic package or some other function help me out here?
I'm sorry there's no programming here. I know. But this is way out of my area of normal expertise.
Solution: use string distance
You're on the right track. Here is some R code to get you started:
install.packages("stringdist") # install this package
library("stringdist")
CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')
CompanyName = tolower(CompanyName) # otherwise case matters too much
# Calculate a string distance matrix; LCS is just one option
?"stringdist-metrics" # see others
sdm = stringdistmatrix(CompanyName, CompanyName, useNames=T, method="lcs")
Let's take a look. These are the calculated distances between strings, using Longest Common Subsequence metric (try others, e.g. cosine, Levenshtein). They all measure, in essence, how many characters the strings have in common. Their pros and cons are beyond this Q&A. You might look into something that gives a higher similarity value to two strings that contain the exact same substring (like dow)
sdm[1:5,1:5]
kraft kraft foods kfraft nestle nestle usa
kraft 0 6 1 9 13
kraft foods 6 0 7 15 15
kfraft 1 7 0 10 14
nestle 9 15 10 0 4
nestle usa 13 15 14 4 0
Some visualization
# Hierarchical clustering
sdm_dist = as.dist(sdm) # convert to a dist object (you essentially already have distances calculated)
plot(hclust(sdm_dist))
If you want to group then explicitly into k groups, use k-medoids.
library("cluster")
clusplot(pam(sdm_dist, 5), color=TRUE, shade=F, labels=2, lines=0)
I am working with a dataframe in R that has three columns: House, Appliance, and Count. The data is essentially an inventory of the different types of kitchen appliances contained within each house on a block. The data look something like this: (spaces added for illustrative purposes)
House Appliance Count
1 Toaster 2
2 Dishwasher 1
2 Toaster 1
2 Refrigerator 1
2 Toaster 1
3 Dishwasher 1
3 Oven 1
For each appliance type, I would like to be able to compute the proportion of houses containing at least one of those appliances. Note that in my data, it is possible for a single house to have zero, one, or multiple appliances in a single category. If a house does not have an appliance, it is not listed in the data for that house. If the house has more than one appliance, the appliance could be listed once with a count >1 (e.g., toasters in House 1), or it could be listed twice (each with count = 1, e.g., toasters in House 2).
As an example showing what I am trying to compute, in the data shown here, the proportion of houses with toasters would be .67 (rounded) because 2/3 of the houses have at least one toaster. Similarly, the proportion of houses with ovens would be 0.33 (since only 1/3 of the houses have an oven). I do not care that any of the houses have more than one toaster -- only that they have at least one.
I have fooled around with xtabs and ftable in R but am not confident that they provide the simplest solution. Part of the problem is that these functions will provide the number of appliances for each house, which then throws off my proportion of houses calculations. Here's my current approach:
temp1 <- xtabs(~House + Appliance, data=housedata)
temp1[temp1[,] > 1] <- 1 # This is needed to correct houses with >1 unit.
proportion.of.houses <- data.frame(margin.table(temp1,2)/3)
This appears to work but it's not elegant. I'm guessing there is a better way to do this in R. Any suggestions much appreciated.
library(data.table)
setDT(df)
n.houses = length(unique(df$House))
df[, length(unique(House))/n.houses, by = Appliance]
library(dplyr)
n <- length(unique(df$House))
df %>%
group_by(Appliance) %>%
summarise(freq = n_distinct(House)/n)
Output:
Appliance freq
1 Dishwasher 0.6666667
2 Oven 0.3333333
3 Refrigerator 0.3333333
4 Toaster 0.6666667
I'm working with two Excel files. One is a very large (~5 to 6 GB) data set. This is from the government's Open Payments Data, which is free and open to everyone to view. It is a file that lists all of the disclosed payments from Industry to physicians, and it is open to the public (which is why I took a screenshot).
The second Excel file I'm working with is also large, but it's a file that lists the disclosed payments from physicians at a particular institution.
My goals:
I'd like to 'filter' out the Open Payments Data to just include the physicians I have in my second Excel file. Is there any way to do that? The Open Payments Data is inconsistent and has some uppercase and lower case.
What I've done so far:
I've been able to parse out the Open Payments Data to just include the state of the physicians I'm looking for. I've also imported both of these .csv files into R and named them accordingly.
I'm taking a course in R right now but it's been no help ... and most of the answers I've found online are for smaller sets of data. The data I'm working with has ~500,000 rows! Thank you in advance for your insight.
Edit: This is head(mydata)
Physician_Profile_ID Physician_First_Name
1 377519 KELLI
2 377519 KELLI
3 377519 KELLI
4 272641 ABDUL
5 272641 ABDUL
6 272641 ABDUL
Physician_Middle_Name Physician_Last_Name
1 A AABY
2 A AABY
3 A AABY
4 A AADAM
5 A AADAM
6 AADAM
Physician_Name_Suffix
1
2
3
4
5
6
Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name
1 BioHorizons Implant Systems Inc.
2 BioHorizons Implant Systems Inc.
3 BioHorizons Implant Systems Inc.
4 APOLLO ENDOSURGERY INC
5 APOLLO ENDOSURGERY INC
6 BOSTON SCIENTIFIC CORPORATION
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name
1 BioHorizons Implant Systems Inc.
2 BioHorizons Implant Systems Inc.
3 BioHorizons Implant Systems Inc.
4 Apollo Endosurgery Inc
5 APOLLO ENDOSURGERY INC
6 Boston Scientific Corporation
Total_Amount_of_Payment_USDollars Date_of_Payment
1 11.55 6/17/2014
2 187.50 6/4/2014
3 222.24 5/23/2014
4 60.20 5/4/2014
5 110.15 7/28/2014
6 12.36 12/10/2014
Form_of_Payment_or_Transfer_of_Value
1 In-kind items and services
2 In-kind items and services
3 In-kind items and services
4 In-kind items and services
5 In-kind items and services
6 In-kind items and services
Nature_of_Payment_or_Transfer_of_Value City_of_Travel
1 Food and Beverage
2 Gift
3 Education
4 Food and Beverage
5 Food and Beverage
6 Food and Beverage
State_of_Travel Country_of_Travel
1
2
3
4
5
6
And this is head(institution_data, 2):
DB.ID Last.Name First.Name
1 12345 Johnson John
2 12354 Twain Mark
Names have been changed for confidentiality. DB ID != Physician_ID unfortunately.
A list (vector actually) of physician IDs could be constructed:
PHY_ID <- unique(
institution_data$DB.ID[ institution_data$DB.ID %in% mydata$Physician_Profile_ID ] )
Then extract the data from the main file using the matches to that vector:
chargedata <- mydata[ mydata$Physician_Profile_ID %in% PHY_ID , ]
Could also use match with the same logic but the %in% function uses match "under the hood" and code written with %in% is generally easier to read. If the ID's were not supposed to match, which you should have stated if that were the case, then name matching could be attempted but it would make sense to add additional criteria, such as state of nearby zipcode.