Find all largest values in a range, across different objects in data frame - r

I wonder if there is an simpler way than writing if...else... for the following case. I have a dataframe and I only want the rows with number in column "percentage" >=95. Moreover, for one object, if there is multiple rows fitting this criteria, I only want the largest one(s). If there are more than one largest ones, I would like to keep all of them.
For example:
object city street percentage
A NY Sun 100
A NY Malino 97
A NY Waterfall 100
B CA Washington 98
B WA Lieber 95
C NA Moon 75
Then I'd like the result shows:
object city street percentage
A NY Sun 100
A NY Waterfall 100
B CA Washington 98
I am able to do it using if else statement, but I feel there should be some smarter ways to say: 1. >=95 2. if more than one, choose the largest 3. if more than one largest, choose them all.

You can do this by creating an variable that indicates the rows that have the maximum percentage for each of the objects. We can then use this indicator to subset the data.
# your data
dat <- read.table(text = "object city street percentage
A NY Sun 100
A NY Malino 97
A NY Waterfall 100
B CA Washington 98
B WA Lieber 95
C NA Moon 75", header=TRUE, na.strings="", stringsAsFactors=FALSE)
# create an indicator to identify the rows that have the maximum
# percentage by object
id <- with(dat, ave(percentage, object, FUN=function(i) i==max(i)) )
# subset your data - keep rows that are greater than 95 and have the
# maximum group percentage (given by id equal to one)
dat[dat$percentage >= 95 & id , ]
This works by the addition statement creating a logical, which can then be used to subset the rows of dat.
dat$percentage >= 95 & id
#[1] TRUE FALSE TRUE TRUE FALSE FALSE
Or putting these together
with(dat, dat[percentage >= 95 & ave(percentage, object,
FUN=function(i) i==max(i)) , ])
# object city street percentage
# 1 A NY Sun 100
# 3 A NY Waterfall 100
# 4 B CA Washington 98

You could do this also in data.table using the same approach by #user20650
library(data.table)
setDT(dat)[dat[,percentage==max(percentage) & percentage >=95, by=object]$V1,]
# object city street percentage
#1: A NY Sun 100
#2: A NY Waterfall 100
#3: B CA Washington 98
Or using dplyr
dat %>%
group_by(object) %>%
filter(percentage==max(percentage) & percentage >=95)

Following works:
ddf2 = ddf[ddf$percentage>95,]
ddf3 = ddf2[-c(1:nrow(ddf2)),]
for(oo in unique(ddf2$object)){
tempdf = ddf2[ddf2$object == oo, ]
maxval = max(tempdf$percentage)
tempdf = tempdf[tempdf$percentage==maxval,]
for(i in 1:nrow(tempdf)) ddf3[nrow(ddf3)+1,] = tempdf[i,]
}
ddf3
object city street percentage
1 A NY Sun 100
3 A NY Waterfall 100
4 B CA Washington 98

Related

How to make neural edges more dynamic in neural plot?

I have data which represents transit between UK cities.
Transit: if there is a transit between these two cities = 1, otherwise
=0
ave.pas: average number of passengers
.
library(plotly)
library(ggraph)
library(tidyverse)
library(tidygraph)
library(igraph)
library(edgebundleR)
df2 <- data.frame (City1 = c("London", "London", "London", "London" ,"Liverpool","Liverpool","Liverpool" , "Manchester", "Manchester", "Bristol"),
City2 = c("Liverpool", "Manchester", "Bristol","Derby", "Manchester", "Bristol","Derby","Bristol","Derby","Derby"),
Transit = c(1,0,1,1,1,1,1,1,0,1),
ave.pas = c(10,0,11,24,40,45,12,34,0,29))
df:
City1 City2 Transit ave.pas
1 London Liverpool 1 10
2 London Manchester 0 0
3 London Bristol 1 11
4 London Derby 1 24
5 Liverpool Manchester 1 40
6 Liverpool Bristol 1 45
7 Liverpool Derby 1 12
8 Manchester Bristol 1 34
9 Manchester Derby 0 0
10 Bristol Derby 1 29
Now I plot circular network:
df <- subset(df2, Transit== 1, select = c("City1","City2"))
edgebundle(graph.data.frame(df),directed=F,tension=0.1,fontsize = 10)
My goal is to set the size or colour's intensitvity of edges based on the corresponding value in 'ave.pas' variable from the dataset
linked links: link1 link2 link3 link4
(Plot must be made using edgebundle() function)
The intensity of the edges in the linked plots appears to be a function of the number of edges joining the vertices. We can make the number of edges equal to the number of passengers, but the problem here is that after a few lines are plotted on top of each other, the intensity stops increasing. It is therefore good for showing the difference between, say, 1 and 3 edges, but the difference between 10 and 30 is much less obvious. As a compromise, we can make the number of edges approximately proportional to the number of passengers. One way to do this is to create the graph from an adjacency matrix:
cities <- unique(c(df2$City1, df$City2))
m <- matrix(0, nrow = length(cities), ncol = length(cities),
dimnames = list(cities, cities))
for(i in seq(nrow(df2))) m[df2[i, 1], df2[i, 2]] <- df2[i, 4]
m <- m/min(m[m > 0])
edgebundle(graph_from_adjacency_matrix(m))

Is there an R function for including column into row data

I would like to perform chi-square test in R by transform data frame from csv file using R from the following structure
Observed Values East West North South
Males 50 142 131 70
Females 435 1523 1356 750
to
following example
Row Observed value Region
1 1 East
2 1 East
3 1 East
...
435 0 East
Given that 1 = male. 0 = female
I been trying to use stack and data frame function to create the new table using R. I need the following table to perform chi-square test in R. The code I am trying is as below:
Stacked_data <- stack(data)
library(dummies)
df1 <- data.frame(id = 1:0, Observed.Values )
df2 <- cbind(Stacked_data, dummy(df1$id, sep = "_"))
Expected result will contain 2 column (observed value and region). Observed value will contain the categorical value for male = 1, and female = 0.Region will contain the region for respective observed value.
So that when i perform
table(Region,Observed Values)
It will produce
Observed Values
Region 1 0
East 50 435
West 142 1523
North 131 1356
South 70 750
Update: based on your expected output, you don't need much at all. Using obs from below, all you need to get your output (on which you can run chisq.test) is:
obs2 <- t(obs[,-1])
dimnames(obs2) <- list(Region = rownames(obs2), Observed = c('0', '1'))
obs2
# Observed
# Region 0 1
# East 50 435
# West 142 1523
# North 131 1356
# South 70 750
But, then again, if all you need is to run a chisq.test on them, it doesn't matter which orientation you use:
### original frame you provided
chisq.test(obs[,-1])
# Pearson's Chi-squared test
# data: as.matrix(obs[, -1])
# X-squared = 1.5959, df = 3, p-value = 0.6603
### transposed/re-labeled frame
chisq.test(obs2)
# Pearson's Chi-squared test
# data: obs2
# X-squared = 1.5959, df = 3, p-value = 0.6603
No difference. Perhaps all you needed was the [,-1] part?
Here's an attempt, though I don't know that it's exactly what you expect. (Input data is at the bottom of this answer.)
library(dplyr)
library(tidyr)
out1 <- obs %>%
gather(Region, v, -Observed) %>%
rowwise() %>%
do( tibble(Region = .$Region, Observed = rep(1L * (.$Observed == "Males"), .$v)) ) %>%
ungroup() %>%
mutate(Row = row_number())
out1
# # A tibble: 4,457 x 3
# Region Observed Row
# <chr> <int> <int>
# 1 East 1 1
# 2 East 1 2
# 3 East 1 3
# 4 East 1 4
# 5 East 1 5
# 6 East 1 6
# 7 East 1 7
# 8 East 1 8
# 9 East 1 9
# 10 East 1 10
# # ... with 4,447 more rows
We can verify that it is reversible with
xtabs(~ Observed + Region, data = out1)
# Region
# Observed East North South West
# 0 435 1356 750 1523
# 1 50 131 70 142
(even if the columns and rows are in a different order as the input, the numbers match).
Data:
obs <- read.table(header=TRUE, stringsAsFactors=FALSE, text="
Observed East West North South
Males 50 142 131 70
Females 435 1523 1356 750 ")

Subset a dataframe by unique combination of values from another dataframe in R

I have a large dataframe A similar to the following and a second one, B, containing only lat/lon values.
What I am trying to do is to subset dataframe A based on the unique combinations of lat/lon from dataframe B.
So far, I have tried the following but does not work.
How should I change my code in order to effectively do this?
head(A)
vals time lon lat mo year
1 5 1978-11-01 100 32 01 1988
2 3 1978-11-02 100 45 02 1988
3 3 1978-11-03 100 45 01 1998
4 9 1978-11-04 100 50 05 1998
5 1 1978-11-05 100 60 05 1998
6 4 1978-11-06 100 32 05 1998
A_subset <-subset(A, A[, "lon"] %in% B$lon | A[, "lat"]
%in% B$lat)
Consider running an expand.grid on data frame B for all combination of unique coordinates. Then merge to data frame A:
B_all_combns <- expand.grid(lon = unique(B$lon), lat = unique(B$lat))
A_subset <- merge(A, B_all_combns, by=c("lon", "lat"))

If/Else statement in R

I have two dataframes in R:
city price bedroom
San Jose 2000 1
Barstow 1000 1
NA 1500 1
Code to recreate:
data = data.frame(city = c('San Jose', 'Barstow'), price = c(2000,1000, 1500), bedroom = c(1,1,1))
and:
Name Density
San Jose 5358
Barstow 547
Code to recreate:
population_density = data.frame(Name=c('San Jose', 'Barstow'), Density=c(5358, 547));
I want to create an additional column named city_type in the data dataset based on condition, so if the city population density is above 1000, it's an urban, lower than 1000 is a suburb, and NA is NA.
city price bedroom city_type
San Jose 2000 1 Urban
Barstow 1000 1 Suburb
NA 1500 1 NA
I am using a for loop for conditional flow:
for (row in 1:length(data)) {
if (is.na(data[row,'city'])) {
data[row, 'city_type'] = NA
} else if (population[population$Name == data[row,'city'],]$Density>=1000) {
data[row, 'city_type'] = 'Urban'
} else {
data[row, 'city_type'] = 'Suburb'
}
}
The for loop runs with no error in my original dataset with over 20000 observations; however, it yields a lot of wrong results (it yields NA for the most part).
What has gone wrong here and how can I do better to achieve my desired result?
I have become quite a fan of dplyr pipelines for this type of join/filter/mutate workflow. So here is my suggestion:
library(dplyr)
# I had to add that extra "NA" there, did you not? Hm...
data <- data.frame(city = c('San Jose', 'Barstow', NA), price = c(2000,1000, 500), bedroom = c(1,1,1))
population <- data.frame(Name=c('San Jose', 'Barstow'), Density=c(5358, 547));
data %>%
# join the two dataframes by matching up the city name columns
left_join(population, by = c("city" = "Name")) %>%
# add your new column based on the desired condition
mutate(
city_type = ifelse(Density >= 1000, "Urban", "Suburb")
)
Output:
city price bedroom Density city_type
1 San Jose 2000 1 5358 Urban
2 Barstow 1000 1 547 Suburb
3 <NA> 500 1 NA <NA>
Using ifelse create the city_type in population_density, then we using match
population_density$city_type=ifelse(population_density$Density>1000,'Urban','Suburb')
data$city_type=population_density$city_type[match(data$city,population_density$Name)]
data
city price bedroom city_type
1 San Jose 2000 1 Urban
2 Barstow 1000 1 Suburb
3 <NA> 1500 1 <NA>

R make new data frame from current one

I'm trying to calculate the best goal differentials in the group stage of the 2014 world cup.
football <- read.csv(
file="http://pastebin.com/raw.php?i=iTXdPvGf",
header = TRUE,
strip.white = TRUE
)
football <- head(football,n=48L)
football[which(max(abs(football$home_score - football$away_score)) == abs(football$home_score - football$away_score)),]
Results in
home home_continent home_score away away_continent away_score result
4 Cameroon Africa 0 Croatia Europe 4 l
7 Spain Europe 1 Netherlands Europe 5 l
37 Germany
So those are the games with the highest goal differntial, but now I need to make a new data frame that has a team name, and abs(football$home_score-football$away_score)
football$score_diff <- abs(football$home_score - football$away_score)
football$winner <- ifelse(football$home_score > football$away_score, as.character(football$home),
ifelse(football$result == "d", NA, as.character(football$away)))
You could save some typing in this way. You first get score differences and winners. When the result indicates w, home is the winner. So you do not have to look into scores at all. Once you add the score difference and winner, you can subset your data by subsetting data with max().
mydf <- read.csv(file="http://pastebin.com/raw.php?i=iTXdPvGf",
header = TRUE, strip.white = TRUE)
mydf <- head(mydf,n = 48L)
library(dplyr)
mutate(mydf, scorediff = abs(home_score - away_score),
winner = ifelse(result == "w", as.character(home),
ifelse(result == "l", as.character(away), "draw"))) %>%
filter(scorediff == max(scorediff))
# home home_continent home_score away away_continent away_score result scorediff winner
#1 Cameroon Africa 0 Croatia Europe 4 l 4 Croatia
#2 Spain Europe 1 Netherlands Europe 5 l 4 Netherlands
#3 Germany Europe 4 Portugal Europe 0 w 4 Germany
Here is another option without using ifelse for creating the "winner" column. This is based on row/column indexes. The numeric column index is created by matching the result column with its unique elements (match(football$result,..), and the row index is just 1:nrow(football). Subset the "football" dataset with columns 'home', 'away' and cbind it with an additional column 'draw' with NAs so that the 'd' elements in "result" change to NA.
football$score_diff <- abs(football$home_score - football$away_score)
football$winner <- cbind(football[c('home', 'away')],draw=NA)[
cbind(1:nrow(football), match(football$result, c('w', 'l', 'd')))]
football[with(football, score_diff==max(score_diff)),]
# home home_continent home_score away away_continent away_score result
#60 Brazil South America 1 Germany Europe 7 l
# score_diff winner
#60 6 Germany
If the dataset is very big, you could speed up the match by using chmatch from library(data.table)
library(data.table)
chmatch(as.character(football$result), c('w', 'l', 'd'))
NOTE: I used the full dataset in the link

Resources