Add new column to long dataframe from another dataframe? - r

Say that I have two dataframes. I have one that lists the names of soccer players, teams that they have played for, and the number of goals that they have scored on each team. Then I also have a dataframe that contains the soccer players ages and their names. How do I add an "names_age" column to the goal dataframe that is the age column for the players in the first column "names", not for "teammates_names"? How do I add an additional column that is the teammates' ages column? In short, I'd like two age columns: one for the first set of players and one for the second set.
> AGE_DF
names age
1 Sam 20
2 Jon 21
3 Adam 22
4 Jason 23
5 Jones 24
6 Jermaine 25
> GOALS_DF
names goals team teammates_names teammates_goals teammates_team
1 Sam 1 USA Jason 1 HOLLAND
2 Sam 2 ENGLAND Jason 2 PORTUGAL
3 Sam 3 BRAZIL Jason 3 GHANA
4 Sam 4 GERMANY Jason 4 COLOMBIA
5 Sam 5 ARGENTINA Jason 5 CANADA
6 Jon 1 USA Jones 1 HOLLAND
7 Jon 2 ENGLAND Jones 2 PORTUGAL
8 Jon 3 BRAZIL Jones 3 GHANA
9 Jon 4 GERMANY Jones 4 COLOMBIA
10 Jon 5 ARGENTINA Jones 5 CANADA
11 Adam 1 USA Jermaine 1 HOLLAND
12 Adam 1 ENGLAND Jermaine 1 PORTUGAL
13 Adam 4 BRAZIL Jermaine 4 GHANA
14 Adam 3 GERMANY Jermaine 3 COLOMBIA
15 Adam 2 ARGENTINA Jermaine 2 CANADA
What I have tried: I've successfully got this to work using a for loop. The actual data that I am working with have thousands of rows, and this takes a long time. I would like a vectorized approach but I'm having trouble coming up with a way to do that.

Try merge or match.
Here's merge (which is likely to screw up your row ordering and can sometimes be slow):
merge(AGE_DF, GOALS_DF, all = TRUE)
Here's match, which makes use of basic indexing and subsetting. Assign the result to a new column, of course.
AGE_DF$age[match(GOALS_DF$names, AGE_DF$names)]
Here's another option to consider: Convert your dataset into a long format first, and then do the merge. Here, I've done it with melt and "data.table":
library(reshape2)
library(data.table)
setkey(melt(as.data.table(GOALS_DF, keep.rownames = TRUE),
measure.vars = c("names", "teammates_names"),
value.name = "names"), names)[as.data.table(AGE_DF)]
# rn goals team teammates_goals teammates_team variable names age
# 1: 1 1 USA 1 HOLLAND names Sam 20
# 2: 2 2 ENGLAND 2 PORTUGAL names Sam 20
# 3: 3 3 BRAZIL 3 GHANA names Sam 20
# 4: 4 4 GERMANY 4 COLOMBIA names Sam 20
# 5: 5 5 ARGENTINA 5 CANADA names Sam 20
# 6: 6 1 USA 1 HOLLAND names Jon 21
## <<SNIP>>
# 28: 13 4 BRAZIL 4 GHANA teammates_names Jermaine 25
# 29: 14 3 GERMANY 3 COLOMBIA teammates_names Jermaine 25
# 30: 15 2 ARGENTINA 2 CANADA teammates_names Jermaine 25
# rn goals team teammates_goals teammates_team variable names age
I've added the rownames so you can you can use dcast to get back to the wide format and retain the row ordering if it's important.

Related

Two way table with mean of a third variable R

Here's my problem. I have a table, of which I show a sample here. I would like to have the Country as row, Stars as column and the mean of the price for each combination. I used aggregate which gave me the info that i want but not how I want it.
The table looks like that :
Country Stars Price
1 Canada 4 567
2 China 2 435
3 Russia 3 456
4 Canada 5 687
5 Canada 4 432
6 Russia 3 567
7 China 4 1200
8 Russia 3 985
9 Canada 2 453
10 Russia 3 234
11 Russia 4 546
12 Canada 3 786
13 China 2 456
14 China 3 234
15 Russia 4 800
16 China 5 987
I used this code :
aggregate(Stars[,3],list(Country=Stars$Country, Stars = Stars$Stars), mean)
output :
Country Stars x
1 Canada 2 453.0
2 China 2 445.5
3 Canada 3 786.0
4 China 3 234.0
5 Russia 3 560.5
6 Canada 4 499.5
7 China 4 1200.0
8 Russia 4 673.0
9 Canada 5 687.0
10 China 5 987.0
Where x stands for the mean, I would like to change x for "price mean" to...
So the goal would be to have one country per row and the number of stars as column with the mean of the price for each pair.
Thank you very much.
It seems you want Excel like pivot table. Here package pivottabler helps much. See, it generates nice html tables too (apart from displaying results)
library(pivottabler)
qpvt(df, "Country", "Stars", "mean(Price)")
2 3 4 5 Total
Canada 453 786 499.5 687 585
China 445.5 234 1200 987 662.4
Russia 560.5 673 598
Total 448 543.666666666667 709 837 614.0625
for formatting use format argument
qpvt(df, "Country", "Stars", "mean(Price)", format = "%.2f")
2 3 4 5 Total
Canada 453.00 786.00 499.50 687.00 585.00
China 445.50 234.00 1200.00 987.00 662.40
Russia 560.50 673.00 598.00
Total 448.00 543.67 709.00 837.00 614.06
for html output use qhpvt instead.
qhpvt(df, "Country", "Stars", "mean(Price)")
Output
Note: tidyverse and baseR methods are also possible and are easy too
To obtain a two-way table of means, after attaching data you can use
tapply(Price, list(Country,Stars), mean)

How to manually enter a cell in a dataframe? [duplicate]

This question already has answers here:
Update a Value in One Column Based on Criteria in Other Columns
(4 answers)
dplyr replacing na values in a column based on multiple conditions
(2 answers)
Closed 2 years ago.
This is my dataframe:
county state cases deaths FIPS
Abbeville South Carolina 4 0 45001
Acadia Louisiana 9 1 22001
Accomack Virginia 3 0 51001
New York C New York 2 0 NA
Ada Idaho 113 2 16001
Adair Iowa 1 0 19001
I would like to manually put "55555" into the NA cell. My actual df is thousands of lines long and the row where the NA changes based on the day. I would like to add based on the county. Is there a way to say df[df$county == "New York C",] <- df$FIPS = "55555" or something like that? I don't want to insert based on the column or row number because they change.
This will put 55555 into the NA cells within column FIPS where country is New York C
df$FIPS[is.na(df$FIPS) & df$county == "New York C"] <- 55555
Output
df
# county state cases deaths FIPS
# 1 Abbeville South Carolina 4 0 45001
# 2 Acadia Louisiana 9 1 22001
# 3 Accomack Virginia 3 0 51001
# 4 New York C New York 2 0 55555
# 5 Ada Idaho 113 2 16001
# 6 Adair Iowa 1 0 19001
# 7 New York C New York 1 0 18000
Data
df
# county state cases deaths FIPS
# 1 Abbeville South Carolina 4 0 45001
# 2 Acadia Louisiana 9 1 22001
# 3 Accomack Virginia 3 0 51001
# 4 New York C New York 2 0 NA
# 5 Ada Idaho 113 2 16001
# 6 Adair Iowa 1 0 19001
# 7 New York C New York 1 0 18000
You could use & (and) to substitute de df$FIPS entries that meet the two desired conditions.
df$FIPS[is.na(df$FIPS) & df$state=="New York"]<-5555
If you want to change values based on multiple conditions, I'd go with dplyr::mutate().
library(dplyr)
df <- df %>%
mutate(FIPS = ifelse(is.na(FIPS) & county == "New York C", 55555, FIPS))

To merge values from another dataframe based on partial string match while the matching column is not in same order

I want to merge one column from df2 with df1 by matching df1$District_name and df2$Districts.
But the character values in df1$District_name and df2$Districts are not in the same order and df1 and df2 are not of same length.
The values do not match exactly. df1 has more rows than df2, so the corresponding values for those extra district names should be zero.
df1=data.frame(State_name=c("Maharashtra","Andhra Pradesh","Bihar","Bihar","West Bengal","Gujarat","Gujarat","Assam"),
District_name=c("Nashik","Chittoor","Madhepura","Kishanganj","Howrah","Gandhinagar","Ahmadabad","Sivasagar"),
Value1=c(5,3,6,4,4,3,2,4))
df2=data.frame(Districts=c("Nashik","Chitoor","Kishanganj","Madhepur","Sibhasagar","Ahmadabad"),
FinanceIndex=c(0.20975,0.12187,0.37155,0.66128,0.10918,0.54730))
# df1
State_name District_name Value1
1 Maharashtra Nashik 5
2 Andhra Pradesh Chittoor 3
3 Bihar Madhepura 6
4 Bihar Kishanganj 4
5 West Bengal Howrah 4
6 Gujarat Gandhinagar 3
7 Gujarat Ahmadabad 2
8 Assam Sivasagar 4
# df2
Districts FinanceIndex
1 Nashik 0.20975
2 Chitoor 0.12187
3 Kishanganj 0.37155
4 Madhepur 0.66128
5 Sibhasagar 0.10918
6 Ahmadabad 0.54730
I used match function but due to the spelling differences, I am getting most of them as zero values.
index<-match(df1$District_name, df2$Districts)
df1$finindex=df2$FinanceIndex[index]
df1$finindex[is.na(df1$finindex]=0
For String matching, I found this function which matches similar phonetic words:
library(RecordLinkage)
soundex('Nellore')==soundex('Vellore')
#FALSE
The output should be :
# df1
State_name District_name Value1 finindex
1 Maharashtra Nashik 5 0.20975
2 Andhra Pradesh Chittoor 3 0.12187
3 Bihar Madhepura 6 0.66128
4 Bihar Kishanganj 4 0.37155
5 West Bengal Howrah 4 0.00000
6 Gujarat Gandhinagar 3 0.00000
7 Gujarat Ahmadabad 2 0.54730
8 Assam Sivasagar 4 0.10918
Is there any way these two functions can be used together to solve the problem? Or any other way to solve the problem?
An option is to do a partial match with stringddist
library(fuzzyjoin)
stringdist_left_join(df1, df2, by = c("District_name" = "Districts")) %>%
select(-Districts)
# State_name District_name Value1 FinanceIndex
#1 Maharashtra Nashik 5 0.20975
#2 Andhra Pradesh Chittoor 3 0.12187
#3 Bihar Madhepura 6 0.66128
#4 Bihar Kishanganj 4 0.37155
#5 West Bengal Howrah 4 NA
#6 Gujarat Gandhinagar 3 NA
#7 Gujarat Ahmadabad 2 0.54730
#8 Assam Sivasagar 4 0.10918

remove individuals based on their range of values

I have a df with two variables, one with IDs and one with a variable called numbers. I would like to excude individuals who do not start their sequence of numbers with the number 1.
I have managed to do this by creating a binary indicator and excluding if the person has this indicator. However, there must be a simpler more elegant way to do this?
Example data and the code I've used to achieve desired result are below.
Thank you.
sample df:
zz<-" names numbers
1 john 1
2 john 2
3 john 3
4 john 4
5 john 5
6 john 6
7 john 7
8 john 8
9 mary 4
10 mary 5
11 mary 6
12 mary 7
13 mary 8
14 mary 9
15 mary 10
16 mary 11
17 mary 12
18 pat 1
19 pat 2
20 pat 3
21 pat 4
22 pat 5
23 pat 6
24 pat 7
25 pat 8
26 pat 9
27 pat 10
28 sue 2
29 sue 3
30 sue 4
31 sue 5
32 sue 6
33 sue 7
34 sue 8
35 sue 9
36 tom 5
37 tom 6
38 tom 7
39 tom 8
40 tom 9
41 tom 10
42 tom 11
"
Data <- read.table(text=zz, header = TRUE)
Step 1 - add binary indicator
df$all<-ifelse(df$numbers==1, 1,0)
df$allperson<-ave(df$all, df$names, FUN=cumsum)
Step two - get rid of people who do not have 1 as their start number
df[!df$allperson==0,]
If you want elegance, I must recommend the package dplyr:
library(dplyr)
Data %>%
group_by(names) %>%
filter(min(numbers) != 1)
It means just what it appears to mean: filter only records where a group (defined by names) has a minimum numbers value inequal to 1.
names numbers
1 mary 4
2 mary 5
3 mary 6
4 mary 7
5 mary 8
6 mary 9
7 mary 10
8 mary 11
9 mary 12
10 sue 2
11 sue 3
You may also try:
zz1 <- zz[with(zz, names %in% unique(names)[!!table(zz)[,1]]),]
head(zz1,4)
# names numbers
#1 john 1
#2 john 2
#3 john 3
#4 john 4

selecting rows of which the value of the variable is equal to certain vector

I have a longitudinal data called df for more than 1000 people that looks like the following:
id year name status
1 1984 James 4
1 1985 James 1
2 1983 John 2
2 1984 John 1
3 1980 Amy 2
3 1981 Amy 2
4 1930 Jane 4
4 1931 Jane 5
I'm trying to subset the data by certain id. For instance, I have a vector dd that consists of ids that I would like to subset:
dd<-c(1,3)
I've tried the following but it did not work, for instance:
subset<-subset(df, subset(df$id==dd))
or
subset<-subset(df, subset(unique(df$id))==dd))
or
subset<-df[which(unique(df$id)==dd),]
or I tried a for-loop
for (i in 1:2){
subset<-subset(df, subset=(unique(df$id)==dd[i]))
}
Would there be a way to only select the rows with the ids that match the numbers in in the vector dd?
Use %in% and logical indexing:
df[df$id %in% dd,]
id year name status
1 1 1984 James 4
2 1 1985 James 1
5 3 1980 Amy 2
6 3 1981 Amy 2
As an alternative you can use 'dplyr' a new package (author: Hadley Wickham ) which provides a blazingly fast set of tools for efficiently manipulating datasets.
require(dplyr)
filter(df,id %in% dd )
id year name status
1 1 1984 James 4
2 1 1985 James 1
3 3 1980 Amy 2
4 3 1981 Amy 2

Resources