Calculating the mean words produced by a participant in R - r

I am working with R. I am working with a dataset and I need to calculate the average of words produced by participants within a group. My data looks like this.
Group Participants WORDS
A John table
A John chair
A John house
A Steph pool
A Steph backyard
A Carlos street
B Pedro stop
B Pedro basket
B Jeff dog
B Alan cat
B Alan river
C Steve ocean
C Steve boat
C Steve hammer
C Steve temperature
C Steve sun
C Bryan outlet
C Mark printer
C Kobe basket
C Kobe internet
C Kobe legend
C Kobe ball
So, for example... within the group A we have only three (3) participants and a total of six (6) words produced. Within that group we have an average of two (2) words produced by each participants.
My problem is that since I am not working with numbers, I don't know how to calculate this in R.

Try this:
library(dplyr)
dft %>%
group_by(Group) %>%
summarise(participants = length(unique(Participants)),
words = length(unique(WORDS)),
mean_words = words/participants)
or (using only dplyr functions)
library(dplyr)
dft %>%
group_by(Group) %>%
summarise(participants = n_distinct(Participants),
words = n_distinct(WORDS),
mean_words = words/participants)

Next time, please provide sample data which is copy-pastable (e.g. using dput).
I generated some sample data as
set.seed(5555)
df <- data.frame(Group=sample(c("A","B","C"),100,replace=TRUE),
Participant= sample(letters[1:6],100,replace=TRUE),
Words = sample(paste0(LETTERS,LETTERS),100,replace=TRUE), stringsAsFactors = FALSE)
Then you can do
groupStats <- lapply(split(df,df$Group), function(wordsAndParticipantsInGroup) {
participantCount <- length(unique(wordsAndParticipantsInGroup$Participant))
wordCount <- length(unique(wordsAndParticipantsInGroup$Words))
meanWords <- wordCount/participantCount
data.frame(participants=participantCount,meanWords = meanWords)
})
groupStats <- data.frame(Group=names(groupStats),do.call("rbind",groupStats))

Related

In R, left join two tables whose 2 potential keys contain missing data

Background:
I'm working with a fairly large (>10,000 rows) dataset of individual cars, and I need to do some analysis on it. I need to keep this dataset d intact, but I'm only going to be analyzing cars made by Japanese companies (e.g. Nissan, Honda, etc.). d contains information like VIN_prefix (the first two letters of a VIN number that indicates the "World Manufacturer Number"), model year, and make, but no explicit indicator of whether the car is made by a Japanese firm. Here's d:
d <- data.frame(
make = c("GMC","Dodge","NA","Subaru","Nissan","Chrysler"),
model_yr = c("1999","2004","1989","1999","2006","2012"),
VIN_prefix = c("1G","1D","JH","JF","NA","2C"),
stringsAsFactors=FALSE)
Here, rows 3, 4, and 5 correspond to Japanese cars: the NA in row 3 is actually an Acura whose make is missing. See below when I get to the other dataset about why this is.
d also lacks some attributes (columns) about cars that I need for my analysis, e.g. the current CEO of Japanese car firms.
Enter another dataset, a, a dataset about Japanese car firms which contains those extra attributes as well as columns that could be used to identify whether a given car (row) in d is made by a Japanese firm. One of those is VIN_prefix; the other is jp_makes, a list of Japanese auto firms. Here's a:
a <- data.frame(
VIN_prefix = c("JH","JF","1N"),
jp_makes = c("Acura","Subaru","Nissan"),
current_ceo = c("Toshihiro Mibe","Tomomi Nakamura","Makoto Ushida"),
stringsAsFactors=FALSE)
Here, we can see that the "Acura" make, missing in the car from row 3 in d, could be identified by its VIN_prefix "JH", which in row 3 of d is not NA.
Goal:
Left join a onto d so that each of the 3 Japanese cars in d gets the relevant corresponding attributes from a - mainly, current_ceo. (Non-Japanese cars in d would have NA for columns joined from a; this is fine.)
Problem:
As you can tell, the two relevant variables in d that could be used as keys in a join - make and VIN_prefix - have missing data in d. The "matching rules" we could use are imperfect: I could match on d$make == a$jp_makes or on d$VIN_prefix == a$VIN_prefix, but they'd each be wrong due to the missing data in d.
What to do?
What I've tried:
I can try left joining on either one of these potential keys, but not all 3 of the Japanese cars in d wind up with their correct information from a:
try1 <- left_join(d, a, by = c("make" = "jp_makes"))
try2 <- left_join(d, a, by = c("VIN_prefix" = "VIN_prefix"))
I can successfully generate an logical 'indicator' variable in d that tells me whether a car is Japanese or not:
entries_make <- a$jp_makes
entries_vin_prefix <- a$VIN_prefix
d<- d %>%
mutate(is_jp = ifelse(d$VIN_prefix %in% entries_vin_prefix | d$make %in% entries_make, 1, 0)
%>% as.logical())
But that only gets me halfway: I still need those other columns from a to sit next to those Japanese cars in d. It's unfeasible to manually fill all the missing data in some other way; the real datasets these toy examples correspond to are too big for that and I don't have the manpower or time.
Ideally, I'd like a dataset that looks something like this:
ideal <- data.frame(
make = c("GMC","Dodge","NA","Subaru","Nissan","Chrysler"),
model_yr = c("1999","2004","1989","1999","2006","2012"),
VIN_prefix = c("1G","1D","JH","JF","NA","2C"),
current_ceo = c("NA", "NA", "Toshihiro Mibe","Tomomi Nakamura","Makoto Ushida", "NA"),
stringsAsFactors=FALSE)
What do you all think? I've looked at other posts (e.g. here) but their solutions don't really apply. Any help is much appreciated!
Left join on an OR of the two conditions.
library(sqldf)
sqldf("select d.*, a.current_ceo
from d
left join a on d.VIN_prefix = a.VIN_prefix or d.make = a.jp_makes")
giving:
make model_yr VIN_prefix current_ceo
1 GMC 1999 1G <NA>
2 Dodge 2004 1D <NA>
3 NA 1989 JH Toshihiro Mibe
4 Subaru 1999 JF Tomomi Nakamura
5 Nissan 2006 NA Makoto Ushida
6 Chrysler 2012 2C <NA>
Use a two pass method. First fill in the missing make (or VIN values). I'll illustrate by filling in make valuesDo notice taht "NA" is not the same as NA. The first is a character value while the latter is a true R missing value, so I'd first convert those to a true missing value. In natural language I am replacing the missing values in d (note correction of df) with values of 'jp_makes' that are taken from a on the basis of matching VIN_prefix values:
is.na( d$make) <- df$make=="NA"
d$make[is.na(df$make)] <- a$jp_makes[
match( d$VIN_prefix[is.na(d$make)], a$VIN_prefix) ]
Now you have the make values filled in on the basis of the table look up in a. It should be trivial to do the match you wanted by using by.x='make', by.y='jp_make'
merge(d, a, by.x='make', by.y='jp_makes', all.x=TRUE)
make model_yr VIN_prefix.x VIN_prefix.y current_ceo
1 Acura 1989 JH JH Toshihiro Mibe
2 Chrysler 2012 2C <NA> <NA>
3 Dodge 2004 1D <NA> <NA>
4 GMC 1999 1G <NA> <NA>
5 Nissan 2006 NA 1N Makoto Ushida
6 Subaru 1999 JF JF Tomomi Nakamura
You can then use the values in VIN_prefix.y to replace the values the =="NA" in VIN_prefix.x.

Summing matched values in two different dataframes

Extremely new to R and coding in general. My intuition is that this should have a very basic answer, so feel free to send me back to basic intro class if this is too basic to spend your time on.
To make things easier I will reduce my problem to a much more simple situation with the same salient features.
I have two dataframes. The first shows how many games some people played as "white". The second shows how many games some people payed as "black". Some players played both as white and black, some others played only in one of these roles.
I would like to merge these two dataframes into one showing all players who have played in either role and how many total games they played, whether as white or black.
A reproducible example:
player_as_white <- c('John', 'Max', 'Grace', 'Zoe', 'Peter')
games_white <- c(sample(1:20,5))
dat1 <- data.frame(player_as_white, games_white)
player_as_black <- c('John', 'Eddie', 'Zoe')
games_black <- c(sample(1:20, 3))
dat2 <- data.frame(player_as_black, games_black)
How do I get a consolidated dataset showing how many total games all 6 players have played, whether as white or black?
Thanks!
For reproducibility, it's good practice to specify a random seed so the example works the same each time you run it, and for others. I'd also suggest using stringsAsFactors = FALSE so that the names are treated as characters and not factors, which will make this a little simpler. (edit: But it should work fine here with the default, too.)
set.seed(0)
player_as_white <- c('John', 'Max', 'Grace', 'Zoe', 'Peter')
games_white <- c(sample(1:20,5))
dat1 <- data.frame(player_as_white, games_white, stringsAsFactors = FALSE)
player_as_black <- c('John', 'Eddie', 'Zoe')
games_black <- c(sample(1:20, 3))
dat2 <- data.frame(player_as_black, games_black, stringsAsFactors = FALSE)
Then we can use merge to combine the two:
merge(dat1, dat2, by.x = "player_as_white", by.y = "player_as_black", all = T)
# player_as_white games_white games_black
#1 Eddie NA 18
#2 Grace 7 NA
#3 John 18 5
#4 Max 6 NA
#5 Peter 15 NA
#6 Zoe 10 19
Or a dplyr solution, which keeps the order from dat1
library(dplyr)
full_join(dat1, dat2, by = c("player_as_white" = "player_as_black"))
# player_as_white games_white games_black
#1 John 18 5
#2 Max 6 NA
#3 Grace 7 NA
#4 Zoe 10 19
#5 Peter 15 NA
#6 Eddie NA 18

Sampling by group without repetition using data.table

I'll use a hypothetical scenario to illustrate the question. Here's a table with musicians and the instrument they play and a table with the composition for a band:
musicians <- data.table(
instrument = rep(c('bass','drums','guitar'), each = 4),
musician = c('Chas','John','Paul','Stuart','Andy','Paul','Peter','Ringo','George','John','Paul','Ringo')
)
band.comp <- data.table(
instrument = c('bass','drums','guitar'),
n = c(2,1,2)
)
To avoid arguments about who is best with which instrument, the band will be assembled by sortition. Here's how I'm doing:
musicians[band.comp, on = 'instrument'][, sample(musician, n), by = instrument]
instrument V1
1: bass Paul
2: bass Chas
3: drums Andy
4: guitar Paul
5: guitar George
The problem is: since there are musicians who play more than one instrument, it can happen that one person is drawn more than once.
One can build a for loop that, for each subsequent subset of instruments, draws musicians and then eliminates those from the rest of the table. But I would like suggestions on how to do this using data.table. Mainly because the kind of problem I need to solve in real life with this logic involves data bases with hundreds of thousands of rows. And also because I'm trying to better understand the data.table syntax.
As a reference, I tried some tips from Andrew Brooks blog, but couldn't come up with a solution.
This can be a solution, first you select an instrument by musician and then you select the musicians of your sample. But it may be that when selecting an instrument per musician your sample size is larger than the population then you will get an error (but in your real data this may not be a problem).
musicians[, .(instrument = sample(instrument, 1)), by = musician][band.comp, on = 'instrument'][, sample(musician, n), by = instrument]
You could expand the band comp into sum(band.comp$n) distinct positions and keep sampling until you find a feasible composition:
roles = musicians[,
CJ(posn = 1:band.comp[.BY, on=.(instrument), x.n], musician = musician)
, by=instrument]
set.seed(1)
while (TRUE){
roles[sample(1:.N), keep := !duplicated(.SD, by="musician") & !duplicated(.SD, by=c("instrument", "posn"))][]
if (sum(roles$keep) == sum(band.comp$n)) break
}
setorder(roles[keep == TRUE, !"keep"])[]
instrument posn musician
1: bass 1 Stuart
2: bass 2 John
3: drums 1 Andy
4: guitar 1 George
5: guitar 2 Paul
There's probably something you could do with linear programming or a bipartite graph to answer the question of whether a feasible comp exists, but it's unclear what "sampling" even means in terms of the distribution over feasible comps.
Came across a relevant post: Randomly draw rows from dataframe based on unique values and column values and eddi's answer is perfect for this OP:
#keep number of musicians per instrument in 1 data.table
musicians[band.comp, n:=n, on=.(instrument)]
#for storing the musician that has been sampled so far
m <- c()
musicians[, {
#exclude sampled musician before sampling
res <- .SD[!musician %chin% m][sample(.N, n[1L])]
m <- c(m, res$musician)
res
}, by=.(instrument)]
sample output:
instrument musician n
1: bass Stuart 2
2: bass Chas 2
3: drums Paul 1
4: guitar John 2
5: guitar Ringo 2
Or more succinctly with error handling as well:
m <- c()
musicians[
band.comp,
on=.(instrument),
j={
s <- setdiff(musician, m)
if (length(s) < n) stop(paste("Not enough musicians playing", .BY))
res <- sample(s, n)
m <- c(m, res)
res
},
by=.EACHI]

R how to import a list with different number of columns to a data frame

I am trying to perform some scientometrics analysis from a Scopus csv file. The first column of the imported csv is like:
Authors,Title,Year,Source title,Volume,Issue,Art. No.,Page start,Page end,Page count,Cited by,DOI,Link,Document Type,Source,EID
"Kuck, L.S., Noreña, C.P.Z.","Microencapsulation of grape (Vitis labrusca var. Bordo) skin phenolic extract using gum Arabic, polydextrose, and partially hydrolyzed guar gum as encapsulating agents",2016,"Food Chemistry","194",,,"569","576",,,10.1016/j.foodchem.2015.08.066,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84940212199&partnerID=40&md5=e4c36e03156570a7fe31c2937b3a170d",Article,Scopus,2-s2.0-84940212199
"Grasel, F.D.S., Ferrão, M.F., Wolf, C.R.","Development of methodology for identification the nature of the polyphenolic extracts by FTIR associated with multivariate analysis",2016,"Spectrochimica Acta - Part A: Molecular and Biomolecular Spectroscopy","153",,,"94","101",,,10.1016/j.saa.2015.08.020,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84939865445&partnerID=40&md5=8239487f4eea9479d698792e6aa348de",Article,Scopus,2-s2.0-84939865445
"De Souza, D., Sbardelotto, A.F., Ziegler, D.R., Marczak, L.D.F., Tessaro, I.C.","Characterization of rice starch and protein obtained by a fast alkaline extraction method",2016,"Food Chemistry","191",, 17279,"36","44",,,10.1016/j.foodchem.2015.03.032,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84938952690&partnerID=40&md5=989cbfcc72286a87f726925732db4b49",Article,Scopus,2-s2.0-84938952690
"Filho, P.R.M., Vercelino, R., Cioato, S.G., Medeiros, L.F., de Oliveira, C., Scarabelot, V.L., Souza, A., Rozisky, J.R., Quevedo, A.S., Adachi, L.N.S., Sanches, P.R.S., Fregni, F., Caumo, W., Torres, I.L.S.","Transcranial direct current stimulation (tDCS) reverts behavioral alterations and brainstem BDNF level increase induced by neuropathic pain model: Long-lasting effect",2016,"Progress in Neuro-Psychopharmacology and Biological Psychiatry","64",,,"44","51",,,10.1016/j.pnpbp.2015.06.016,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84937468588&partnerID=40&md5=b03f0ccfbf66a49a438c9983cc2e8f9d",Article,Scopus,2-s2.0-84937468588
"Duarte, Á.T., Borges, A.R., Zmozinski, A.V., Dessuy, M.B., Welz, B., De Andrade, J.B., Vale, M.G.R.","Determination of lead in biomass and products of the pyrolysis process by direct solid or liquid sample analysis using HR-CS GF AAS",2016,"Talanta","146",,,"166","174",,,10.1016/j.talanta.2015.08.041,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84940416990&partnerID=40&md5=55d7ddad27e955b9b6e269469e29c8c3",Article,Scopus,2-s2.0-84940416990
"Francischini, H., Paes Neto, V.D., Martinelli, A.G., Pereira, V.P., Marinho, T.S., Teixeira, V.P.A., Ferraz, M.L.F., Soares, M.B., Schultz, C.L.","Invertebrate traces in pseudo-coprolites from the upper Cretaceous Marília Formation (Bauru Group), Minas Gerais State, Brazil",2016,"Cretaceous Research","57",,,"29","39",,,10.1016/j.cretres.2015.07.016,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84939175950&partnerID=40&md5=b049de15a08ba477cc189d7e8fe7f0a3",Article,Scopus,2-s2.0-84939175950
"Bonfatti, B.R., Hartemink, A.E., Giasson, E., Tornquist, C.G., Adhikari, K.","Digital mapping of soil carbon in a viticultural region of Southern Brazil",2016,"Geoderma","261",,,"204","221",,,10.1016/j.geoderma.2015.07.016,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84939499978&partnerID=40&md5=b470166e01648dcbe8f0d43be86c84e0",Article,Scopus,2-s2.0-84939499978
"Scaramuzza dos Santos, T.C., Holanda, E.C., de Souza, V., Guerra-Sommer, M., Manfroi, J., Uhl, D., Jasper, A.","Evidence of palaeo-wildfire from the upper Lower Cretaceous (Serra do Tucano Formation, Aptian-Albian) of Roraima (North Brazil)",2016,"Cretaceous Research","57",,,"46","49",,,10.1016/j.cretres.2015.08.003,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84939615367&partnerID=40&md5=e59f5130c6a2e1863f9aa77c960e6462",Article,Scopus,2-s2.0-84939615367
"da Silva, S.W., Bortolozzi, J.P., Banús, E.D., Bernardes, A.M., Ulla, M.A.","TiO<inf>2</inf> thick films supported on stainless steel foams and their photoactivity in the nonylphenol ethoxylate mineralization",2016,"Chemical Engineering Journal","283",, 14049,"1264","1272",,,10.1016/j.cej.2015.08.057,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84940747062&partnerID=40&md5=aebc7357f9dedaadeebabfeda4aa3dd9",Article,Scopus,2-s2.0-84940747062
"Dalmora, A.C., Ramos, C.G., Oliveira, M.L.S., Teixeira, E.C., Kautzmann, R.M., Taffarel, S.R., de Brum, I.A.S., Silva, L.F.O.","Chemical characterization, nano-particle mineralogy and particle size distribution of basalt dust wastes",2016,"Science of the Total Environment","539",, 18331,"560","565",,,10.1016/j.scitotenv.2015.08.141,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84941754626&partnerID=40&md5=1c4ca1a3486ff55f92f238083af3eb50",Article,Scopus,2-s2.0-84941754626
"Fink, J.R., Inda, A.V., Bavaresco, J., Barrón, V., Torrent, J., Bayer, C.","Adsorption and desorption of phosphorus in subtropical soils as affected by management system and mineralogy",2016,"Soil and Tillage Research","155",,,"62","68",,,10.1016/j.still.2015.07.017,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84940195225&partnerID=40&md5=2e43a874f1e36f11aa5efa057ce660b9",Article,Scopus,2-s2.0-84940195225
"Martins, A.B., Santana, R.M.C.","Effect of carboxylic acids as compatibilizer agent on mechanical properties of thermoplastic starch and polypropylene blends",2016,"Carbohydrate Polymers","135",,,"79","85",,,10.1016/j.carbpol.2015.08.074,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84940781718&partnerID=40&md5=426e62c6c0de33a91bdb2f75442fbd6f",Article,Scopus,2-s2.0-84940781718
In each line, there are a variable number of authors (up to more than 20). Until now, I am doing something like:
test <- read.csv("test.csv")
test$Authors <- as.character(test$Authors)
test2 <- strsplit(as.character(test$Authors), '.,', fixed=TRUE)
Which gives me a list correctly separating each author. I tested several alternatives proposed in the list to move the list to a data frame, but the closer one was:
test3 <- str_split_fixed(test$Authors, '.,', n = 20)
Which gave me two problems:
1) I have to define the number of columns, which I do not know before analyzing the data;
2) the authors are not properly separated, but surname and abbreviated names are in different columns. Additionally, the command removed some characters from the names.
Some of the strategies suggested elsewhere gave me the correct separation of authors in columns, but the empty columns were fulfilled by repeating the initial names. Sorry if the question is too naive, I am beginning in the use of R.
Any advises and or insights would be greatly appreciated!
Here's how I would do it.
Firstly, using read.csv is causing the split between Authors last name and initial, so I'm using readLines instead.
Secondly, having "wide data" like this is not in general a good idea. It makes data more difficult to work with in subsequent analyses. For that reason, I've made it "long" instead.
n1 <- readLines(con="test.csv")
n1 <- strsplit(n1, '., ', fixed=TRUE)
n1 <- do.call(rbind, lapply(1:length(n1), function(x){data.frame(aut = n1[[x]], pub = x, order = 1:length(n1[[x]]))}))
n1$aut <- gsub("\\.$", "", n1$aut)
Here's the output:
aut pub order
1 Kuck, L.S 1 1
2 Noreña, C.P.Z 1 2
3 Grasel, F.D.S 2 1
4 Ferrão, M.F 2 2
5 Wolf, C.R 2 3
6 Abreu, M.S 3 1
7 Giacomini, A.C.V 3 2
8 Gusso, D 3 3
9 Rosa, J.G.S 3 4
NB if you really want your data in "wide format", we can easily reshape it:
library(tidyr)
spread(n1, order, aut)
pub 1 2 3 4
1 1 Kuck, L.S Noreña, C.P.Z <NA> <NA>
2 2 Grasel, F.D.S Ferrão, M.F Wolf, C.R <NA>
3 3 Abreu, M.S Giacomini, A.C.V Gusso, D Rosa, J.G.S
EDIT: for your full version, you need to use read.csv:
input <- n1 <- read.csv("test.csv")
n1$Authors <- as.character(n1$Authors)
n1$Authors <- strsplit(n1$Authors, '., ', fixed=TRUE)
n1 <- do.call(rbind, lapply(1:length(n1$Authors), function(x){data.frame(aut = n1$Authors[[x]], pub = x, order = 1:length(n1$Authors[[x]]))}))
n1$aut <- gsub("\\.$", "", n1$aut)
If you want to go back to wide with all your stuff:
library(dplyr)
library(tidyr)
input <- mutate(input, row = row_number())
n1 %>% spread(order, aut) %>%
left_join(input, by = c("pub" = "row")) %>%
select(-Authors)

Converting scraped R data using readHTMLTable()

I'm trying to scrape this website http://www.hockeyfights.com/fightlog/ but having hard time putting the into a nice data frame. So far I have this:
> asdf <- htmlParse("http://www.hockeyfights.com/fightlog/1")
> asdf.asdf <- readHTMLTable(asdf)
Then I get this giant list. How do I convert this into a 2 column dataframe that has only player names (who were in a fight) with n rows (number of fights)?
Thanks for your help in advance.
Is this the output you're after?
require(RCurl); require(XML)
asdf <- htmlParse("http://www.hockeyfights.com/fightlog/1")
asdf.asdf <- readHTMLTable(asdf)
First, make a table of each player and the count of fights they've been in...
# get variable with player names
one <- as.character(na.omit(asdf.asdf[[1]]$V3))
# get counts of how many times each name appears
two <- data.frame(table(one))
# remove non-name data
three <- two[two$one != 'Away / Home Player',]
# check
head(three)
one Freq
1 Aaron Volpatti 1
3 Brandon Bollig 1
4 Brian Boyle 1
5 Brian McGrattan 1
6 Chris Neil 2
7 Colin Greening 1
Second, make a table of who is in each fight...
# make data frame of pairs by subsetting the vector of names
four <- data.frame(away = one[seq(2, length(one), 3)],
home = one[seq(3, length(one), 3)])
# check
head(four)
away home
1 Brian Boyle Zdeno Chara
2 Tom Sestito Chris Neil
3 Dale Weise Mark Borowiecki
4 Brandon Bollig Brian McGrattan
5 Scott Hartnell Eric Brewer
6 Colin Greening Aaron Volpatti

Resources