r string parsing challenge - r

I am dealing with a column that contains strings as follows
Col1
------------------------------------------------------------------
Department of Mechanical Engineering, Department of Computer Science
Division of Advanced Machining, Center for Mining and Metallurgy
Department of Aerospace, Center for Science and Delivery
What I am trying to do is separate strings containing the words starting with either, Department or Divison or Center until comma(,) the final output should look like this
Dept_Mechanical_Eng Dept_Computer_Science Div_Adv_Machining Cntr_Mining_Metallurgy Dept_Aerospace Cntr_Science_Delivery
1 1 0 0 0 0
0 0 1 1 0 0
0 0 1 1 1 1
I have butchered the actual names just for aesthetic purpose in the expected output. Any help on parsing this string is much appreciated.

This is very similar to a question I just did tabulating another text example. Are you in the same class as the questioner here? Count the number of times (frequency) a string occurs
inp <- "Department of Mechanical Engineering, Department of Computer Science
Division of Advanced Machining, Center for Mining and Metallurgy
Department of Aerospace, Center for Science and Delivery"
inp2 <- factor(scan(text=inp,what="",sep=","))
#Read 6 items
inp3 <- readLines(textConnection(inp))
as.data.frame( setNames( lapply(levels(inp2), function(ll) as.numeric(grepl(ll, inp3) ) ), trimws(levels(inp2) )) )
Department.of.Aerospace Division.of.Advanced.Machining
1 0 0
2 0 1
3 1 0
Center.for.Mining.and.Metallurgy Center.for.Science.and.Delivery
1 0 0
2 1 0
3 0 1
Department.of.Computer.Science Department.of.Mechanical.Engineering
1 1 1
2 0 0
3 0 0

Related

How do I make a selected table confined to a matrix, rather than a running list?

For my previous lines of code for making tables from column names, they successfully made short and dense matrices for me to readily process data from two questions (from survey results): (2nd example).
However, when I try using the same line of code (above), I don't get that sleek matrix. I end up getting a list of un-linked tables, which I do not want. Perhaps it's due to the new column only having 0's and 1's as numeric characters, vs. the others that have more than 2: (1st example).
[Please forgive my formatting issues (StackOverflow Status: Newbie). Also, many thanks in advance to those checking in on and answering my question!]
>table(select(data_final, `Relationship 2Affected Individual`, Satisfied_Treatments))
Relationship 2Affected Individual 1
1 0
2 0
3 0
6 0
Other (please specify) 0
, , 1 = 1, Response = 10679308122
0
Relationship 2Affected Individual 1
1 0
2 0
3 0
6 0
Other (please specify) 0
, ,
...
> table(select(data_final, `Relationship 2Affected Individual`, Indirect_Benefits))
Indirect_Benefits
Relationship 2Affected Individual 0 1 2 3
1 4 1 0 0
2 42 17 9 3
3 12 1 1 0
6 5 2 2 0
Other (please specify) 1 0 0 0
>#rstudioapi::versionInfo()
>#packageVersion("dplyr")
table(data_final$Relationship 2Affected Individual, data_final$Satisfied_Treatments)
Problem Solved^

Counting frequencies of several groups at the same time

I am tracking the handling of many research fields in peer-reviewed literature, and have processed almost 1500 papers. In my data file I have columns for 25 topics, annotated in the data file with 1 (presence) and 0 (absence). The data file schematically looks like this:
TITLE AUTHORS JOURNAL YEAR ... TOPIC1 TOPIC2 TOPIC3 TOPIC4 ... TOPIC25
'xxx' 'yyy' 'zzz' 2002 1 0 0 1 0
'xxx' 'yyy' 'zzz' 2012 0 0 0 0 1
'xxx' 'yyy' 'zzz' 2002 0 0 1 1 0
'xxx' 'yyy' 'zzz' 2015 1 0 0 0 0
'xxx' 'yyy' 'zzz' 2015 0 0 0 0 0
'xxx' 'yyy' 'zzz' 2013 0 0 1 1 1
'xxx' 'yyy' 'zzz' 2012 1 0 0 0 0
'xxx' 'yyy' 'zzz' 2012 0 0 1 0 1
I need to count the frequencies of various topics in the papers and end up with a data frame looking like this:
TOPIC count
TOPIC1 7
TOPICS2 19
.
.
TOPIC25 15
I've been googling, reading about and trying a few different things, but nothing have worked so far, thus no code posted.
Any help greatly appreciated...
We can loop over the columns of interest, get the sum and stack it to create a two column 'data.frame'.
res <- setNames(stack(lapply(df1[grep("^TOPIC\\d+", names(df1))],
sum))[2:1], c("TOPIC", "count"))
head(res,2)
# TOPIC count
#1 TOPIC1 7
#2 TOPIC2 19
If the column names doesn't have any pattern, use the column index to subset the columns i.e. suppose if POPABU is the 5th column and POPGEN is the last column,
res <- setNames(stack(lapply(df1[5:ncol(df1)],
sum))[2:1], c("TOPIC", "count"))

Frequency Distribution Plot of Document Term Matrix

I have created a document term matrix that looks something like this:
inspect(dtm[1:4,1:6])
allowed allowing almost alone companyunder companywide
Doc1.txt 1 1 1 0 1 0
Doc2.txt 0 1 1 0 1 1
Doc3.txt 0 0 0 1 0 1
Doc4.txt 1 0 1 0 1 1
After taking it's column sum it gives me.
colSums(dtm)
allowed 2
allowing 2
almost 3
alone 1
companyunder 3
companywide 3
This essentially indicates that these words are found in how many documents (for eg allowed 2 tells me that allowed is found in two documents.).
I'm having difficulty in creating a frequency distribution plot which will have x-axis as the document number and y-axis as the number of words the document contains.
Is this what you're looking for?
dtm = array(c(1,0,0,1,1,1,0,0,1,1,0,1,0,0,1,0,1,1,0,1,0,1,1,1),dim=c(4,6))
dimnames(dtm) = list(c("Doc1","Doc2","Doc3","Doc4"),c("allowed","allowing","almost","alone","companyunder","companywide"))
print(dtm)
plot(rowSums(dtm))

R Team Roster Optimization w/ lpSolve

I am new to R and have a particular fantasy sports team optimization problem I would like to solve. I have seen other posts use lpSolve for similar problems but I can not seem to wrap my head around the code. Example data table below. Every player is on a team, plays a particular role, has a salary, and has avg points produced per game. The constraints that I need are I need exactly 8 players. No more than 3 players may come from any one team. There must be at least one player for each role (of 5). And cumulative salary must not exceed $10,000.
Team Player Role Avgpts Salary
Bears A T 22 930
Bears B M 19 900
Bears C B 30 1300
Bears D J 25 970
Bears E S 20 910
Jets F T 21 920
Jets G M 26 980
[...]
In R, I write in the following
> obj = DF$AVGPTS
> con = rbind(t(model.matrix(~ Role + 0, DF)), rep(1,nrow(DF)), DF$Salary)
> dir = c(">=",">=",">=",">=",">=","==","<=")
> rhs = c(1,1,1,1,1,8,10000)
> result = lp("max", obj, con, dir, rhs, all.bin = TRUE)
This code works fine in producing the optimal fantasy team without the limitation of no more than 3 players may come from any one team. This is where I am stuck and I suspect it relates to the con argument. Any help is appreciated.
What if you added something similar to the way you did the roles to con?
If you add t(model.matrix(~ Team + 0, DF)) you'll have indicators for each team in your constraint. For the example you gave:
> con <- rbind(t(model.matrix(~ Role + 0,DF)), t(model.matrix(~ Team + 0, DF)), rep(1,nrow(DF)), DF$Salary)
> con
1 2 3 4 5 6 7
RoleB 0 0 1 0 0 0 0
RoleJ 0 0 0 1 0 0 0
RoleM 0 1 0 0 0 0 1
RoleS 0 0 0 0 1 0 0
RoleT 1 0 0 0 0 1 0
TeamBears 1 1 1 1 1 0 0
TeamJets 0 0 0 0 0 1 1
1 1 1 1 1 1 1
930 900 1300 970 910 920 980
We now need to update dir and rhs to account for this:
dir <- c(">=",">=",">=",">=",">=",rep('<=',n_teams),"<=","<=")
rhs <- c(1,1,1,1,1,rep(3,n_teams),8,10000)
With n_teams set appropriately.

How can I calculate an inner product with an arbitrary number of columns using ddply?

I want to perform an inner product of the first D columns for each row in a data frame with a given array, W. I am trying the following:
W = (1,2,3);
ddply(df, .(id), transform, inner_product=c(col1, col2, col3) %*% W);
This works but I typically may have an arbitrary number of columns. Can I generalize the above expression to handle that case?
Update:
This is an updated example as asked for in the comments:
libary(kernlab);
data(spam);
W = array();
W[1:3] = seq(1,3);
spamdf = head(spam);
spamdf$id = seq(1,nrow(spamdf));
df_out=ddply(spamdf, .(id), transform, inner_product=c(make, address, all) %*% W);
> W
[1] 1 2 3
> spamdf[1,]
make address all num3d our over remove internet order mail receive will
1 0 0.64 0.64 0 0.32 0 0 0 0 0 0 0.64
people report addresses free business email you credit your font num000
1 0 0 0 0.32 0 1.29 1.93 0 0.96 0 0
money hp hpl george num650 lab labs telnet num857 data num415 num85
1 0 0 0 0 0 0 0 0 0 0 0 0
technology num1999 parts pm direct cs meeting original project re edu table
1 0 0 0 0 0 0 0 0 0 0 0 0
conference charSemicolon charRoundbracket charSquarebracket charExclamation
1 0 0 0 0 0.778
charDollar charHash capitalAve capitalLong capitalTotal type id
1 0 0 3.756 61 278 spam 1
> df_out[1,]
make address all num3d our over remove internet order mail receive will
1 0 0.64 0.64 0 0.32 0 0 0 0 0 0 0.64
people report addresses free business email you credit your font num000
1 0 0 0 0.32 0 1.29 1.93 0 0.96 0 0
money hp hpl george num650 lab labs telnet num857 data num415 num85
1 0 0 0 0 0 0 0 0 0 0 0 0
technology num1999 parts pm direct cs meeting original project re edu table
1 0 0 0 0 0 0 0 0 0 0 0 0
conference charSemicolon charRoundbracket charSquarebracket charExclamation
1 0 0 0 0 0.778
charDollar charHash capitalAve capitalLong capitalTotal type id inner_product
1 0 0 3.756 61 278 spam 1 3.2
The above example performs a inner product of the first three dimensions with an array W=(1,2,3) of the spam data set available in kernlab package. Here I have explicity specified the first three dimensions as c(make, address, all).
Thus df_out[1,"inner_product"] = 3.2.
Instead I want to perform the inner product over all the dimensions without having to list all the dimensions. The conversion to a matrix and back to a data frame seems to be an expensive operation?
A strategy along the lines of the following should work:
Convert each chunk to a matrix
Perform a matrix multiplication
Convert results to data.frame
The code:
set.seed(1)
df <- data.frame(
id=sample(1:5, 20, replace=TRUE),
col1 = runif(20),
col2 = runif(20),
col3 = runif(20),
col4 = runif(20)
)
W <- c(1,2,3,4)
ddply(df, .(id), function(x)as.data.frame(as.matrix(x[, -1]) %*% W))
The results:
id V1
1 1 4.924994
2 1 5.076043
3 2 7.053864
4 2 5.237132
5 2 6.307620
6 2 3.413056
7 2 5.182214
8 2 7.623164
9 3 5.194714
10 3 6.733229
11 4 4.122548
12 4 3.569013
13 4 4.978939
14 4 5.513444
15 4 5.840900
16 4 6.526522
17 5 3.530220
18 5 3.549646
19 5 4.340173
20 5 3.955517
If you want to append a column of cross-products, you could do this (assuming W had the right number of elements to match the non-"id" columns:
df2 <- cbind(df, as.matrix(df[, -grep("id", names(df))]) %*% W )
It does not appear that the .(id) serves any useful purpose, since you are not do a sum of crossproducts within id, and if you were then you wouldn't be using transform but some other aggregating function.

Resources