wanted to extract table info from the webpage - web-scraping

i just wanted to scrape the Countries in the world by population table from below table
"https://www.worldometers.info/world-population/population-by-country/"
sample data

You can grab the desired table data using pandas
import pandas as pd
import requests
headers={"User-Agent":"mozilla/5.0"}
url='https://www.worldometers.info/world-population/population-by-country/'
red=requests.get(url,headers=headers).text
df = pd.read_html(red)[0]
print(df)
Output:
0 1 China 1439323776 ... 38 61 % 18.47
%
1 2 India 1380004385 ... 28 35 % 17.70
%
2 3 United States 331002651 ... 38 83 % 4.25
%
3 4 Indonesia 273523615 ... 30 56 % 3.51
%
4 5 Pakistan 220892340 ... 23 35 % 2.83
%
.. ... ... ... ... ... ... ...
230 231 Montserrat 4992 ... N.A. 10 % 0.00
%
231 232 Falkland Islands 3480 ... N.A. 66 % 0.00
%
232 233 Niue 1626 ... N.A. 46 % 0.00
%
233 234 Tokelau 1357 ... N.A. 0 % 0.00
%
234 235 Holy See 801 ... N.A. N.A. 0.00
%
[235 rows x 12 columns]

Related

How to make data frame from two vectors in R?

I have two vectors here. One is all the data about population for various countries:
## [1] "China 1439323776 0.39 5540090 153 9388211 -348399 1.7 38 61 18.47"
## [2] "India 1380004385 0.99 13586631 464 2973190 -532687 2.2 28 35 17.70"
## [3] "United States 331002651 0.59 1937734 36 9147420 954806 1.8 38 83 4.25"
## [4] "Indonesia 273523615 1.07 2898047 151 1811570 -98955 2.3 30 56 3.51"
## [5] "Pakistan 220892340 2.00 4327022 287 770880 -233379 3.6 23 35 2.83"
## [6] "Brazil 212559417 0.72 1509890 25 8358140 21200 1.7 33 88 2.73"
## [7] "Nigeria 206139589 2.58 5175990 226 910770 -60000 5.4 18 52 2.64"
## [8] "Bangladesh 164689383 1.01 1643222 1265 130170 -369501 2.1 28 39 2.11"
## [9] "Russia 145934462 0.04 62206 9 16376870 182456 1.8 40 74 1.87 "
## [10] "Tokelau 1357 1.27 17 136 10 N.A. N.A. 0 0.00"
## [11] "Holy See 801 0.25 2 2003 0 N.A. N.A. N.A. 0.00"
The other vector is all the column names in the exact order corresponding to the country name and those numbers above:
## [1] "Country(ordependency)" "Population(2020)" "YearlyChange"
## [4] "NetChange" "Density(P/Km²)" "LandArea(Km²)"
## [7] "Migrants(net)" "Fert.Rate" "Med.Age"
## [10] "UrbanPop%" "WorldShare"
How do I make a dataframe that match the column names corresponding to the its data such like this:
head(population)
Country (or dependency) Population (2020) Yearly Change Net Change Density (P/Km²) ......
1 China 1439323776 0.39 5540090 ... ....
2 India 1380004385 0.99 13586631 .......
3 United States 331002651 0.59 1937734 .......
4 Indonesia 273523615 1.07 2898047 .......
5 Pakistan 220892340 2.00 4327022 .......
Note: For the last two countries Tokelau and Holy See there are no "Migrants(net)" data.
TIA!
EDIT:
Some more samples are here:
## [53] "Côte d'Ivoire 26378274 2.57 661730 83 318000 -8000 4.7 19 51 0.34"
## [86] "Czech Republic (Czechia) 10708981 0.18 19772 139 77240 22011 1.6 43 74 0.14"
## [93] "United Arab Emirates 9890402 1.23 119873 118 83600 40000 1.4 33 86 0.13"
## [98] "Papua New Guinea 8947024 1.95 170915 20 452860 -800 3.6 22 13 0.11"
## [135] "Bosnia and Herzegovina 3280819 -0.61 -20181 64 51000 -21585 1.3 43 52 0.04"
## [230] "Saint Pierre & Miquelon 5794 -0.48 -28 25 230 N.A. N.A. 100 0.00"
UPDATES:
Here is the problem:
tail(population)
## Country(ordependency) Population(2020) YearlyChange NetChange
## 230 Saint Pierre & Miquelon 5794 -0.48 -28
## 231 Montserrat 4992 0.06 3
## 232 Falkland Islands 3480 3.05 103
## 233 Niue 1626 0.68 11
## 234 Tokelau 1357 1.27 17
## 235 Holy See 801 0.25 2
## Density(P/Km²) LandArea(Km²) Migrants(net) Fert.Rate **Med.Age** **UrbanPop%**
## 230 25 230 N.A. N.A. 100 0.00
## 231 50 100 N.A. N.A. 10 0.00
## 232 0 12170 N.A. N.A. 66 0.00
## 233 6 260 N.A. N.A. 46 0.00
## 234 136 10 N.A. N.A. 0 0.00
## 235 2003 0 N.A. N.A. N.A. 0.00
## **WorldShare**
## 230 NA
## 231 NA
## 232 NA
## 233 NA
## 234 NA
## 235 NA
All the rows with 10 variables instead of 11 are here:
## [202] "Isle of Man 85033 0.53 449 149 570 N.A. N.A. 53 0.00"
## [203] "Andorra 77265 0.16 123 164 470 N.A. N.A. 88 0.00"
## [204] "Dominica 71986 0.25 178 96 750 N.A. N.A. 74 0.00"
## [205] "Cayman Islands 65722 1.19 774 274 240 N.A. N.A. 97 0.00"
## [206] "Bermuda 62278 -0.36 -228 1246 50 N.A. N.A. 97 0.00"
## [207] "Marshall Islands 59190 0.68 399 329 180 N.A. N.A. 70 0.00"
## [208] "Northern Mariana Islands 57559 0.60 343 125 460 N.A. N.A. 88 0.00"
## [209] "Greenland 56770 0.17 98 0 410450 N.A. N.A. 87 0.00"
## [210] "American Samoa 55191 -0.22 -121 276 200 N.A. N.A. 88 0.00"
## [211] "Saint Kitts & Nevis 53199 0.71 376 205 260 N.A. N.A. 33 0.00"
## [212] "Faeroe Islands 48863 0.38 185 35 1396 N.A. N.A. 43 0.00"
## [213] "Sint Maarten 42876 1.15 488 1261 34 N.A. N.A. 96 0.00"
## [214] "Monaco 39242 0.71 278 26337 1 N.A. N.A. N.A. 0.00"
## [215] "Turks and Caicos 38717 1.38 526 41 950 N.A. N.A. 89 0.00"
## [216] "Saint Martin 38666 1.75 664 730 53 N.A. N.A. 0 0.00"
## [217] "Liechtenstein 38128 0.29 109 238 160 N.A. N.A. 15 0.00"
## [218] "San Marino 33931 0.21 71 566 60 N.A. N.A. 97 0.00"
## [219] "Gibraltar 33691 -0.03 -10 3369 10 N.A. N.A. N.A. 0.00"
## [220] "British Virgin Islands 30231 0.67 201 202 150 N.A. N.A. 52 0.00"
## [221] "Caribbean Netherlands 26223 0.94 244 80 328 N.A. N.A. 75 0.00"
## [222] "Palau 18094 0.48 86 39 460 N.A. N.A. N.A. 0.00"
## [223] "Cook Islands 17564 0.09 16 73 240 N.A. N.A. 75 0.00"
## [224] "Anguilla 15003 0.90 134 167 90 N.A. N.A. N.A. 0.00"
## [225] "Tuvalu 11792 1.25 146 393 30 N.A. N.A. 62 0.00"
## [226] "Wallis & Futuna 11239 -1.69 -193 80 140 N.A. N.A. 0 0.00"
## [227] "Nauru 10824 0.63 68 541 20 N.A. N.A. N.A. 0.00"
## [228] "Saint Barthelemy 9877 0.30 30 470 21 N.A. N.A. 0 0.00"
## [229] "Saint Helena 6077 0.30 18 16 390 N.A. N.A. 27 0.00"
## [230] "Saint Pierre & Miquelon 5794 -0.48 -28 25 230 N.A. N.A. 100 0.00"
## [231] "Montserrat 4992 0.06 3 50 100 N.A. N.A. 10 0.00"
## [232] "Falkland Islands 3480 3.05 103 0 12170 N.A. N.A. 66 0.00"
## [233] "Niue 1626 0.68 11 6 260 N.A. N.A. 46 0.00"
## [234] "Tokelau 1357 1.27 17 136 10 N.A. N.A. 0 0.00"
## [235] "Holy See 801 0.25 2 2003 0 N.A. N.A. N.A. 0.00"
It would be easier to read with read.table with delimiter space. But, there is an issue with space as the 'Country' may have multiple words and this should be read as a single column. In order to do that, we can insert single quotes as boundary for the Country using sub and then read with read.table while specifying the col.names as 'v2'
df1 <- read.table(text = sub("^([^0-9]+)\\s", ' "\\1"', v1),
header = FALSE, col.names = v2, fill = TRUE, check.names = FALSE)
-output
df1
Country(ordependency) Population(2020) YearlyChange NetChange Density(P/Km²) LandArea(Km²) Migrants(net) Fert.Rate Med.Age UrbanPop%
1 China 1439323776 0.39 5540090 153 9388211 -348399 1.7 38 61
2 India 1380004385 0.99 13586631 464 2973190 -532687 2.2 28 35
3 United States 331002651 0.59 1937734 36 9147420 954806 1.8 38 83
4 Indonesia 273523615 1.07 2898047 151 1811570 -98955 2.3 30 56
5 Pakistan 220892340 2.00 4327022 287 770880 -233379 3.6 23 35
6 Brazil 212559417 0.72 1509890 25 8358140 21200 1.7 33 88
7 Nigeria 206139589 2.58 5175990 226 910770 -60000 5.4 18 52
8 Bangladesh 164689383 1.01 1643222 1265 130170 -369501 2.1 28 39
9 Russia 145934462 0.04 62206 9 16376870 182456 1.8 40 74
10 Tokelau 1357 1.27 17 136 10 N.A. N.A. 0 0
11 Holy See 801 0.25 2 2003 0 N.A. N.A. N.A. 0
12 Côte d'Ivoire 26378274 2.57 661730 83 318000 -8000 4.7 19 51
13 Czech Republic (Czechia) 10708981 0.18 19772 139 77240 22011 1.6 43 74
14 United Arab Emirates 9890402 1.23 119873 118 83600 40000 1.4 33 86
15 Papua New Guinea 8947024 1.95 170915 20 452860 -800 3.6 22 13
16 Bosnia and Herzegovina 3280819 -0.61 -20181 64 51000 -21585 1.3 43 52
17 Saint Pierre & Miquelon 5794 -0.48 -28 25 230 N.A. N.A. 100 0
WorldShare
1 18.47
2 17.70
3 4.25
4 3.51
5 2.83
6 2.73
7 2.64
8 2.11
9 1.87
10 NA
11 NA
12 0.34
13 0.14
14 0.13
15 0.11
16 0.04
17 NA
For those cases where the count is less, we can update the column values by shifting the columns values with row/column indexing
library(stringr)
cnt <- str_count(sub("^([^0-9]+)\\s", '', v1), "\\s+") + 2
i1 <- cnt == 10
df1[i1, 10:11] <- df1[i1, 9:10]
df1[i1, 9] <- NA
data
v1 <- c("China 1439323776 0.39 5540090 153 9388211 -348399 1.7 38 61 18.47",
"India 1380004385 0.99 13586631 464 2973190 -532687 2.2 28 35 17.70",
"United States 331002651 0.59 1937734 36 9147420 954806 1.8 38 83 4.25",
"Indonesia 273523615 1.07 2898047 151 1811570 -98955 2.3 30 56 3.51",
"Pakistan 220892340 2.00 4327022 287 770880 -233379 3.6 23 35 2.83",
"Brazil 212559417 0.72 1509890 25 8358140 21200 1.7 33 88 2.73",
"Nigeria 206139589 2.58 5175990 226 910770 -60000 5.4 18 52 2.64",
"Bangladesh 164689383 1.01 1643222 1265 130170 -369501 2.1 28 39 2.11",
"Russia 145934462 0.04 62206 9 16376870 182456 1.8 40 74 1.87 ",
"Tokelau 1357 1.27 17 136 10 N.A. N.A. 0 0.00", "Holy See 801 0.25 2 2003 0 N.A. N.A. N.A. 0.00",
"Côte d'Ivoire 26378274 2.57 661730 83 318000 -8000 4.7 19 51 0.34",
"Czech Republic (Czechia) 10708981 0.18 19772 139 77240 22011 1.6 43 74 0.14",
"United Arab Emirates 9890402 1.23 119873 118 83600 40000 1.4 33 86 0.13",
"Papua New Guinea 8947024 1.95 170915 20 452860 -800 3.6 22 13 0.11",
"Bosnia and Herzegovina 3280819 -0.61 -20181 64 51000 -21585 1.3 43 52 0.04",
"Saint Pierre & Miquelon 5794 -0.48 -28 25 230 N.A. N.A. 100 0.00"
)
v2 <- c("Country(ordependency)", "Population(2020)", "YearlyChange",
"NetChange", "Density(P/Km²)", "LandArea(Km²)", "Migrants(net)",
"Fert.Rate", "Med.Age", "UrbanPop%", "WorldShare")
I am not sure what you mean but you could try:
df <- do.call(rbind.data.frame, vector1)
colnames(df) <- vector2

duplicate low frequency values in a factor

I need to duplicate those levels whose frequency in my factor variable called groups is less than 500.
> head(groups)
[1] 0000000 1000000 1000000 1000000 0000000 0000000
75 Levels: 0000000 0000001 0000010 0000100 0000110 0001000 0001010 0001100 0001110 0010000 0010010 0010100 0010110 ... 1111110
For example:
> table(group)
group
0000000 0000001 0000010 0000100 0000110 0001000 0001010 0001100 0001110 0010000 0010010 0010100 0010110 0011000 0011010 0011100
58674 6 1033 654 223 1232 31 222 17 818 132 32 15 42 9 9
0011110 0100000 0100001 0100010 0100100 0100101 0100110 0101000 0101010 0101100 0101110 0110000 0110010 0110100 0110110 0111000
1 10609 1 487 64 1 58 132 11 12 3 142 27 9 7 11
0111010 0111100 0111110 1000000 1000001 1000010 1000011 1000100 1000101 1000110 1001000 1001001 1001010 1001100 1001110 1010000
5 1 2 54245 10 1005 1 329 1 138 573 1 31 71 11 969
1010010 1010100 1010110 1011000 1011010 1011100 1011110 1100000 1100001 1100010 1100011 1100100 1100110 1101000 1101010 1101011
147 29 21 63 15 10 4 14161 6 770 1 142 96 260 23 1
1101100 1101110 1110000 1110001 1110010 1110100 1110110 1111000 1111010 1111100 1111110
34 16 439 2 103 13 26 36 13 8 5
Groups 0000001, 0000110, 0001010, 0001100... must be duplicated up to 500.
The ideal would be to have a "sample balanced data" of groups that duplicate those levels often less than 500 and penalize the rest (Levels more than 500 frequency) until reaching 500.
We can use rep on the levels of 'group' for the desired 'n'
factor(rep(levels(group), each = n))
If we need to use the table results as well
factor(rep(levels(group), table(group) + n-table(group)) )
Or with pmax
factor(rep(levels(group), pmax(n, table(levels(group)))))
data
set.seed(24)
group <- factor(sample(letters[1:6], 3000, replace = TRUE))
n <- 500

How to dynamically select columns

Lets assume i ran a random Forest model and i get the variable importance info as below:
set.seed(121)
ImpMeasure<-data.frame(mod.varImp$importance)
ImpMeasure$Vars<-row.names(ImpMeasure)
ImpMeasure.df<-ImpMeasure[order(-ImpMeasure$Overall),]
row.names(ImpMeasure.df)<-NULL
class(ImpMeasure.df)
ImpMeasure.df<-ImpMeasure.df[,c(2,1)] # so now we have the importance variable info in a data frame
ImpMeasure.df
Vars Overall
1 num_voted_users 100.000000
2 num_critic_for_reviews 58.961441
3 num_user_for_reviews 56.500707
4 movie_facebook_likes 50.680318
5 cast_total_facebook_likes 30.012205
6 gross 27.652559
7 actor_3_facebook_likes 24.094213
8 actor_2_facebook_likes 19.633290
9 imdb_score 16.063007
10 actor_1_facebook_likes 15.848972
11 duration 11.886036
12 budget 11.853066
13 title_year 7.804387
14 director_facebook_likes 7.318787
15 facenumber_in_poster 1.868376
16 aspect_ratio 0.000000
Now If i decide that i want only top 5 variables for further analysis then in do this:
library(dplyr)
top.var<-ImpMeasure.df[1:5,] %>% select(Vars)
top.var
Vars
1 num_voted_users
2 num_critic_for_reviews
3 num_user_for_reviews
4 movie_facebook_likes
5 cast_total_facebook_likes
How can use this info to select these var only from the original dataset (given below) without spelling out the actual variable names but using say the output of top.var....how to use dplyr select function for this..
My original dataset is like this:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes
1 723 178 0 855
2 302 169 563 1000
3 602 148 0 161
4 813 164 22000 23000
5 255 95 131 782
6 462 132 475 530
actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes
1 1000 760505847 886204 4834
2 40000 309404152 471220 48350
3 11000 200074175 275868 11700
4 27000 448130642 1144337 106759
5 131 228830 8 143
6 640 73058679 212204 1873
facenumber_in_poster num_user_for_reviews budget title_year
1 0 3054 237000000 2009
2 0 1238 300000000 2007
3 1 994 245000000 2015
4 0 2701 250000000 2012
5 0 97 26000000 2002
6 1 738 263700000 2012
actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes cluster
1 936 7.9 1.78 33000 2
2 5000 7.1 2.35 0 3
3 393 6.8 2.35 85000 2
4 23000 8.5 2.35 164000 3
5 12 7.1 1.85 0 1
6 632 6.6 2.35 24000 2
movies.imp<-moviesdf.cluster%>% select(one_of(top.vars),cluster)
head(movies.imp)
## num_voted_users num_user_for_reviews num_critic_for_reviews
## 1 886204 3054 723
## 2 471220 1238 302
## 3 275868 994 602
## 4 1144337 2701 813
## 5 8 127 37
## 6 212204 738 462
## movie_facebook_likes cast_total_facebook_likes cluster
## 1 33000 4834 1
## 2 0 48350 1
## 3 85000 11700 1
## 4 164000 106759 1
## 5 0 143 2
## 6 24000 1873 1
That done!
Hadley provided the answer to that here:
select_(df, .dots = top.var)

Merging uneven Panel Data frames in R

I have two sets of panel data that I would like to merge. The problem is that, for each respective time interval, the variable which links the two data sets appears more frequently in the first data frame than the second. My objective is to add each row from the second data set to its corresponding row in the first data set, even if that necessitates copying said row multiple times in the same time interval. Specifically, I am working with basketball data from the NBA. The first data set is a panel of Player and Date while the second is one of Team (Tm) and Date. Thus, each Team entry should be copied multiple times per date, once for each player on that team who played that day. I could do this easily in excel, but the data frames are too large.
The result is 0 observations of 52 variables. I've experimented with bind, match, different versions of merge, and I've searched for everything I can think of; but, nothing seems to address this issue specifically. Disclaimer, I am very new to R.
Here is my code up until my road block:
HGwd = "~/Documents/Fantasy/Basketball"
library(plm)
library(mice)
library(VIM)
library(nnet)
library(tseries)
library(foreign)
library(ggplot2)
library(truncreg)
library(boot)
Pdata = read.csv("2015-16PlayerData.csv", header = T)
attach(Pdata)
Pdata$Age = as.numeric(as.character(Pdata$Age))
Pdata$Date = as.Date(Pdata$Date, '%m/%e/%Y')
names(Pdata)[8] = "OppTm"
Pdata$GS = as.factor(as.character(Pdata$GS))
Pdata$MP = as.numeric(as.character(Pdata$MP))
Pdata$FG = as.numeric(as.character(Pdata$FG))
Pdata$FGA = as.numeric(as.character(Pdata$FGA))
Pdata$X2P = as.numeric(as.character(Pdata$X2P))
Pdata$X2PA = as.numeric(as.character(Pdata$X2PA))
Pdata$X3P = as.numeric(as.character(Pdata$X3P))
Pdata$X3PA = as.numeric(as.character(Pdata$X3PA))
Pdata$FT = as.numeric(as.character(Pdata$FT))
Pdata$FTA = as.numeric(as.character(Pdata$FTA))
Pdata$ORB = as.numeric(as.character(Pdata$ORB))
Pdata$DRB = as.numeric(as.character(Pdata$DRB))
Pdata$TRB = as.numeric(as.character(Pdata$TRB))
Pdata$AST = as.numeric(as.character(Pdata$AST))
Pdata$STL = as.numeric(as.character(Pdata$STL))
Pdata$BLK = as.numeric(as.character(Pdata$BLK))
Pdata$TOV = as.numeric(as.character(Pdata$TOV))
Pdata$PF = as.numeric(as.character(Pdata$PF))
Pdata$PTS = as.numeric(as.character(Pdata$PTS))
PdataPD = plm.data(Pdata, index = c("Player", "Date"))
attach(PdataPD)
Tdata = read.csv("2015-16TeamData.csv", header = T)
attach(Tdata)
Tdata$Date = as.Date(Tdata$Date, '%m/%e/%Y')
names(Tdata)[3] = "OppTm"
Tdata$MP = as.numeric(as.character(Tdata$MP))
Tdata$FG = as.numeric(as.character(Tdata$FG))
Tdata$FGA = as.numeric(as.character(Tdata$FGA))
Tdata$X2P = as.numeric(as.character(Tdata$X2P))
Tdata$X2PA = as.numeric(as.character(Tdata$X2PA))
Tdata$X3P = as.numeric(as.character(Tdata$X3P))
Tdata$X3PA = as.numeric(as.character(Tdata$X3PA))
Tdata$FT = as.numeric(as.character(Tdata$FT))
Tdata$FTA = as.numeric(as.character(Tdata$FTA))
Tdata$PTS = as.numeric(as.character(Tdata$PTS))
Tdata$Opp.FG = as.numeric(as.character(Tdata$Opp.FG))
Tdata$Opp.FGA = as.numeric(as.character(Tdata$Opp.FGA))
Tdata$Opp.2P = as.numeric(as.character(Tdata$Opp.2P))
Tdata$Opp.2PA = as.numeric(as.character(Tdata$Opp.2PA))
Tdata$Opp.3P = as.numeric(as.character(Tdata$Opp.3P))
Tdata$Opp.3PA = as.numeric(as.character(Tdata$Opp.3PA))
Tdata$Opp.FT = as.numeric(as.character(Tdata$Opp.FT))
Tdata$Opp.FTA = as.numeric(as.character(Tdata$Opp.FTA))
Tdata$Opp.PTS = as.numeric(as.character(Tdata$Opp.PTS))
TdataPD = plm.data(Tdata, index = c("OppTm", "Date"))
attach(TdataPD)
PD = merge(PdataPD, TdataPD, by = "OppTm", all.x = TRUE)
attach(PD)
Any help on how to do this would be greatly appreciated!
EDIT
I've tweaked it a little from last night, but still nothing seems to do the trick. See the above, updated code for what I am currently using.
Here is the output for head(PdataPD):
Player Date Rk Pos Tm X..H OppTm W.L GS MP FG FGA FG. X2P
22408 Aaron Brooks 2015-10-27 817 G CHI CLE W 0 16 3 9 0.333 3
22144 Aaron Brooks 2015-10-28 553 G CHI # BRK W 0 16 5 9 0.556 3
21987 Aaron Brooks 2015-10-30 396 G CHI # DET L 0 18 2 6 0.333 1
21456 Aaron Brooks 2015-11-01 4687 G CHI ORL W 0 16 3 11 0.273 3
21152 Aaron Brooks 2015-11-03 4383 G CHI # CHO L 0 17 5 8 0.625 1
20805 Aaron Brooks 2015-11-05 4036 G CHI OKC W 0 13 4 8 0.500 3
X2PA X2P. X3P X3PA X3P. FT FTA FT. ORB DRB TRB AST STL BLK TOV PF PTS GmSc
22408 8 0.375 0 1 0.000 0 0 NA 0 2 2 0 0 0 2 1 6 -0.9
22144 3 1.000 2 6 0.333 0 0 NA 0 1 1 3 1 0 1 4 12 8.5
21987 2 0.500 1 4 0.250 0 0 NA 0 4 4 4 0 0 0 1 5 5.2
21456 6 0.500 0 5 0.000 0 0 NA 2 1 3 1 1 1 1 4 6 1.0
21152 3 0.333 4 5 0.800 0 0 NA 0 0 0 4 1 0 0 4 14 12.6
20805 5 0.600 1 3 0.333 0 0 NA 1 1 2 0 0 0 0 1 9 5.6
FPTS H.A
22408 7.50 H
22144 20.25 A
21987 16.50 A
21456 14.75 H
21152 24.00 A
20805 12.00 H
And for head(TdataPD):
OppTm Date Rk X Opp Result MP FG FGA FG. X2P X2PA X2P. X3P X3PA
2105 ATL 2015-10-27 71 DET L 94-106 240 37 82 0.451 29 55 0.527 8 27
2075 ATL 2015-10-29 41 # NYK W 112-101 240 42 83 0.506 32 59 0.542 10 24
2047 ATL 2015-10-30 13 CHO W 97-94 240 36 83 0.434 28 60 0.467 8 23
2025 ATL 2015-11-01 437 # CHO W 94-92 240 37 88 0.420 30 59 0.508 7 29
2001 ATL 2015-11-03 413 # MIA W 98-92 240 37 90 0.411 30 69 0.435 7 21
1973 ATL 2015-11-04 385 BRK W 101-87 240 37 76 0.487 29 54 0.537 8 22
X3P. FT FTA FT. PTS Opp.FG Opp.FGA Opp.FG. Opp.2P Opp.2PA Opp.2P. Opp.3P
2105 0.296 12 15 0.800 94 37 96 0.385 25 67 0.373 12
2075 0.417 18 26 0.692 112 38 93 0.409 32 64 0.500 6
2047 0.348 17 22 0.773 97 36 88 0.409 24 58 0.414 12
2025 0.241 13 14 0.929 94 32 86 0.372 18 49 0.367 14
2001 0.333 17 22 0.773 98 38 86 0.442 33 58 0.569 5
1973 0.364 19 24 0.792 101 36 83 0.434 31 62 0.500 5
Opp.3PA Opp.3P. Opp.FT Opp.FTA Opp.FT. Opp.PTS
2105 29 0.414 20 26 0.769 106
2075 29 0.207 19 21 0.905 101
2047 30 0.400 10 13 0.769 94
2025 37 0.378 14 15 0.933 92
2001 28 0.179 11 16 0.688 92
1973 21 0.238 10 13 0.769 87
If there is way to truncate the output from dput(head(___)), I am not familiar with it. It appears that simply erasing the excess characters would remove entire variables from the dataset.
It would help if you posted your data (or a working subset of it) and a little more detail on how you are trying to merge, but if I understand what you are trying to do, you want each final data record to have individual stats for each player on a particular date followed by the player's team's stats for that date. In this case, you should have a team column in the Player table that identifies the player's team, and then join the two tables on the composite key Date and Team by setting the by= attribute in merge:
merge(PData, TData, by=c("Date", "Team"))
The fact that the data frames are of different lengths doesn't matter--this is exactly what join/merge operations are for.
For an alternative to merge(), you might check out the dplyr package join functions at https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html

Equivalent of index - match in Excel to return greater than the lookup value

In R I need to perform a similar function to index-match in Excel which returns the value just greater than the look up value.
Data Set A
Country GNI2009
Ukraine 6604
Egypt 5937
Morocco 5307
Philippines 4707
Indonesia 4148
India 3677
Viet Nam 3180
Pakistan 2760
Nigeria 2699
Data Set B
GNI2004 s1 s2 s3 s4
6649 295 33 59 3
6021 260 30 50 3
5418 226 27 42 2
4846 193 23 35 2
4311 162 20 29 2
3813 134 16 23 1
3356 109 13 19 1
2976 89 10 15 1
2578 68 7 11 0
2248 51 5 8 0
2199 48 5 8 0
At the 2009 level GNI for each country (data set A) I would like to find out which GNI2004 is just greater than or equal to GNI2009 and then return the corresponding sales values (s1,s2...) at that row (data set B). I would like to repeat this for each and every Country-gni row for 2009 in table A.
For example: Nigeria with a GNI2009 of 2698 in data set A would return:
GNI2004 s1 s2 s3 s4
2976 89 10 15 1
In Excel I guess this would be something like Index and Match where the match condition would be match(look up value, look uparray,-1)
You could try data.tables rolling join which designed to achieve just that
library(data.table) # V1.9.6+
indx <- setDT(DataB)[setDT(DataA), roll = -Inf, on = c(GNI2004 = "GNI2009"), which = TRUE]
DataA[, names(DataB) := DataB[indx]]
DataA
# Country GNI2009 GNI2004 s1 s2 s3 s4
# 1: Ukraine 6604 6649 295 33 59 3
# 2: Egypt 5937 6021 260 30 50 3
# 3: Morocco 5307 5418 226 27 42 2
# 4: Philippines 4707 4846 193 23 35 2
# 5: Indonesia 4148 4311 162 20 29 2
# 6: India 3677 3813 134 16 23 1
# 7: Viet Nam 3180 3356 109 13 19 1
# 8: Pakistan 2760 2976 89 10 15 1
# 9: Nigeria 2699 2976 89 10 15 1
The idea here is per each row in GNI2009 find the closest equal/bigger value in GNI2004, get the row index and subset. Then we update DataA with the result.
See here for more information.

Resources