counting occurrence of strings across multiple columns in R [duplicate] - r

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 5 years ago.
I have a dataset in R which looks like the following (only relevant columns shown). It has sex disaggregated data on what crops respondents wanted more information about and how much of a priority this crop for them.
sex wantcropinfo1 priority1 wantcropinfo2 priority2
m wheat high eggplant medium
m rice low cabbage high
m rice high
f eggplant medium
f cotton low
...
I want to be able to (a) count the total occurrences of each crop across all the wantcropinfoX columns; and (b) get the same count but sort them by priority; and (c) do the same thing but disaggregated by sex.
(a) output should look like this:
crop count
wheat 1
eggplant 2
rice 2
...
(b) output should look like this:
crop countm countf
wheat 1 0
eggplant 1 1
rice 2 0
...
(c) should look like this:
crop high_m med_m low_m high_f med_f low_f
wheat 1 0 0 0 0 0
eggplant 0 1 0 0 1 0
rice 1 0 1 0 0 0
...
I'm a bit of an R newbie and the manuals are slightly bewildering. I've googled a lot but couldn't find anything that was quite like this even though it seems like a fairly common thing one might want to do. Similar questions on stackoverflow seemed to be asking something a bit different.

We can use melt from data.table to convert from 'wide' to 'long' format. It can take multiple measure columns.
library(data.table)
dM <- melt(setDT(df1), measure = patterns("^want", "priority"),
value.name = c("crop", "priority"))[crop!='']
In the 'long' format, we get the 3 expected results by either grouping by 'crop' and get the number of rows or convert to 'wide' with dcast specifying the fun.aggregate as length.
dM[,.(count= .N) , crop]
# crop count
#1: wheat 1
#2: rice 2
#3: eggplant 2
#4: cotton 1
#5: cabbage 1
dcast(dM, crop~sex, value.var='sex', length)
# crop f m
#1: cabbage 0 1
#2: cotton 1 0
#3: eggplant 1 1
#4: rice 0 2
#5: wheat 0 1
dcast(dM, crop~priority+sex, value.var='priority', length)
# crop high_m low_f low_m medium_f medium_m
#1: cabbage 1 0 0 0 0
#2: cotton 0 1 0 0 0
#3: eggplant 0 0 0 1 1
#4: rice 1 0 1 0 0
#5: wheat 1 0 0 0 0

Use ddply function in the plyr package.
The structure of how you use this function is the following:
ddply(dataframe,.(var1,var2,...), summarize, function)
In this case you might want to do the follow for:
a) ddply(df,.(wantcropinfo1),summarize,count=length(wantcropinfo1))
b)ddply(df,.(wantcropinfo1,priority),summarize,count=length(wantcropinfo1))
c) ddply(df,.(wantcropinfo1,priority,sex),summarize,count=length(wantcropinfo1))
Note that the output will not have the same structure you mention in your question but the information will be the same. For the mentioned structure use the table function

Related

Cavs vs. Warriors - probability of Cavs winning the series includes combinations like "0,1,0,0,0,1,1" - but the series is over after game 5

There is a problem in DataCamp about computing the probability of winning an NBA series. Cavs and the Warriors are playing a seven game championship series. The first to win four games wins the series. They each have a 50-50 chance of winning each game. If the Cavs lose the first game, what is the probability that they win the series?
Here is how DataCamp computed the probability using Monte Carlo simulation:
B <- 10000
set.seed(1)
results<-replicate(B,{x<-sample(0:1,6,replace=T) # 0 when game is lost and 1 when won.
sum(x)>=4})
mean(results)
Here is a different way they computed the probability using simple code:
# Assign a variable 'n' as the number of remaining games.
n<-6
# Assign a variable `outcomes` as a vector of possible game outcomes: 0 indicates a loss and 1 a win for the Cavs.
outcomes<-c(0,1)
# Assign a variable `l` to a list of all possible outcomes in all remaining games. Use the `rep` function on `list(outcomes)` to create list of length `n`.
l<-rep(list(outcomes),n)
# Create a data frame named 'possibilities' that contains all combinations of possible outcomes for the remaining games.
possibilities<-expand.grid(l) # My comment: note how this produces 64 combinations.
# Create a vector named 'results' that indicates whether each row in the data frame 'possibilities' contains enough wins for the Cavs to win the series.
rowSums(possibilities)
results<-rowSums(possibilities)>=4
# Calculate the proportion of 'results' in which the Cavs win the series.
mean(results)
Question/Problem:
They both produce approximately the same probability of winning the series ~ 0.34. However, there seems to be a flaw in the the concept and the code design. For example, the code (sampling six times) allows for combinations such as the following:
G2 G3 G4 G5 G6 G7 rowSums
0 0 0 0 0 0 0 # Series over after G4 (Cavs lose). No need for game G5-G7.
0 0 0 0 1 0 1 # Series over after G4 (Cavs lose). Double counting!
0 0 0 0 0 1 1 # Double counting!
...
1 1 1 1 0 0 4 # No need for game G6 and G7.
1 1 1 1 0 1 5 # Double counting! This is the same as 1,1,1,1,0,0.
0 1 1 1 1 1 5 # No need for game G7.
1 1 1 1 1 1 6 # Series over after G5 (Cavs win). Double counting!
> rowSums(possibilities)
[1] 0 1 1 2 1 2 2 3 1 2 2 3 2 3 3 4 1 2 2 3 2 3 3 4 2 3 3 4 3 4 4 5 1 2 2 3 2 3 3 4 2 3 3 4 3 4 4 5 2 3 3 4 3 4 4 5 3 4 4 5 4 5 5 6
As you can see, these are never possible. After winning the first four of the remaining six games, no more games should be played. Similarly, after losing the first three games of the remaining six games, no more games should be played. So these combinations shouldn't be included in the computation of the probability of winning the series. There is double counting for some of the combinations.
Here is what I did to omit some of the combinations that are not possible in real life.
outcomes<-c(0,1)
l<-rep(list(outcomes),6)
possibilities<-expand.grid(l)
possibilities<-possibilities %>% mutate(rowsums=rowSums(possibilities)) %>% filter(rowsums<=4)
But then I am not able to omit the other unnecessary combinations. For example, I want to remove two of these three: (a) 1,0,0,0,0,0 (b) 1,0,0,0,0,1 (c) 1,0,0,0,1,1. This is because no more games will be played after losing three times in a row. And they are basically double counting.
There are too many conditions for me to be able to filter them individually. There has to be a more efficient and intuitive way to do this. Can someone provide me with some hints on how to solve this whole mess?
Here is a way:
library(dplyr)
outcomes<-c(0,1)
l<-rep(list(outcomes),6)
possibilities<-expand.grid(l)
possibilities %>%
mutate(rowsums=rowSums(cur_data()),
anti_sum = rowSums(!cur_data())) %>%
filter(rowsums<=4, anti_sum <= 3)
We use the fact that r can coerce into a logical where 0 will be false. See sum(!0) as a short example.

Aggregate a value by 2 variables

I have a dataframe that looks something like this
AgeBracket No of People No of Jobs
18-25 2 5
18-25 2 2
26-34 4 6
35-44 4 0
26-34 2 3
35-44 1 7
45-54 3 2
From this I want to aggregate the data so it looks like the following:
AgeBracket 1Person 2People 3People 4People
18-25 0 3.5 0 0
26-34 0 3 0 6
35-44 7 0 0 0
45-54 0 0 2 0
So along the Y axis is the Age Bracket and along X (top row) is the number of people while in the cells it show's the average number of jobs for that age bracket and number of people.
I assume it's something to do with aggregation but can't find anything similar to this on any site.
Here is a data.table method using dcast.
library(data.table)
setnames(dcast(df, AgeBracket ~ People, value.var="Jobs", fun.aggregate=mean, fill=0),
c("AgeBracket", paste0(sort(unique(df$People)), "Person")))[]
Here, dcast reshapes wide, putting persons as separate variables. fun.aggregate is used to calculate the mean number of jobs across ageBracket-person cells. fill is set to 0.
setnames is used to rename the variables as the default is the integer values. and [] at the end is used to print out the result.
AgeBracket 1Person 2Person 3Person 4Person
1: 18-25 0 3.5 0 0
2: 26-34 0 3.0 0 6
3: 35-44 7 0.0 0 0
4: 45-54 0 0.0 2 0
This can be stretched out into two lines, which is probably more readable.
# reshape wide and calculate means
df.wide <- dcast(df, AgeBracket ~ People, value.var="Jobs", fun.aggregate=mean, fill=0)
# rename variables
setnames(df.wide, c("AgeBracket", paste0(names(df.wide)[-1], "Person")))
Assuming df is your data.frame then you can use aggregate with mean function using BaseR, but I think data.table way is the faster as suggested by Imo:
agg <- aggregate(No.of.Jobs ~ AgeBracket + No.of.People,data=df,mean)
fin <- reshape2::dcast(agg,AgeBracket ~ No.of.People)
fin[is.na(fin)] <- 0
names(fin) <- c("AgeBracket",paste0("People",1:4))
As suggested by #Imo, a one-liner could be this:
reshape2::dcast(df, AgeBracket ~ No.of.People, value.var="No.of.Jobs", fun.aggregate=mean, fill=0)
we need to just rename the columns after that.
OUtput:
AgeBracket People1 People2 People3 People4
1 18-25 0 3.5 0 0
2 26-34 0 3.0 0 6
3 35-44 7 0.0 0 0
4 45-54 0 0.0 2 0

making a table with multiple columns in r

I´m obviously a novice in writing R-code.
I have tried multiple solutions to my problem from stackoverflow but I'm still stuck.
My dataset is carcinoid, patients with a small bowel cancer, with multiple variables.
i would like to know how different variables are distributed
carcinoid$met_any - with metastatic disease 1=yes, 2=no(computed variable)
carcinoid$liver_mets_y_n - liver metastases 1=yes, 2=no
carcinoid$regional_lymph_nodes_y_n - regional lymph nodes 1=yes, 2=no
peritoneal_carcinosis_y_n - peritoneal carcinosis 1=yes, 2=no
i have tried this solution which is close to my wanted result
ddply(carcinoid, .(carcinoid$met_any), summarize,
livermetastases=sum(carcinoid$liver_mets_y_n=="1"),
regionalmets=sum(carcinoid$regional_lymph_nodes_y_n=="1"),
pc=sum(carcinoid$peritoneal_carcinosis_y_n=="1"))
with the result being:
carcinoid$met_any livermetastases regionalmets pc
1 1 21 46 7
2 2 21 46 7
Now, i expected the row with 2(=no metastases), to be empty. i would also like the rows in the column carcinoid$met_any to give the number of patients.
If someone could help me it would be very much appreciated!
John
Edit
My dataset, although the column numbers are: 1, 43,28,31,33
1=yes2=no
case_nr met_any liver_mets_y_n regional_lymph_nodes_y_n pc
1 1 1 1 2
2 1 2 1 2
3 2 2 2 2
4 1 2 1 1
5 1 2 1 1
desired output - I want to count the numbers of 1:s and 2:s, if it works, all 1:s should end up in the met_any=1 row
nr liver_mets regional_lymph_nodes pc
met_any=1 4 1 4 2
met_any=2 1 4 1 3
EDIT
Although i probably was very unclear in my question, with your help i could make the table i needed!
setDT(carcinoid)[,lapply(.SD,table),.SDcols=c(43,28,31,33,17)]
gives
met_any lymph_nod liver_met paraortal extrahep
1: 50 46 21 6 15
2: 111 115 140 151 146
i am very grateful! #mtoto provided the solution
John
Based on your example data, this data.table approach works:
library(data.table)
setDT(df)[,lapply(.SD,table),.SDcols=c(2:5)]
# met_any liver_mets_y_n regional_lymph_nodes_y_n pc
# 1: 4 1 4 2
# 2: 1 4 1 3

How to clean and re-code check-all-that-apply responses in R survey data?

I've got survey data with some multiple-response questions like this:
HS18 Why is it difficult to get medical care in South Africa? (Select all that apply)
1 Too expensive
2 No transportation to the hospital/clinic
3 Hospital/clinic is too far away
4 Hospital/clinic staff do not speak my language
5 Hospital/clinic staff do not like foreigners
6 Wait time too long
7 Cannot take time off of work
8 None of these. I have no problem accessing medical care
where multiple responses were entered with commas and are recorded as different levels i.e.:
unique(HS18)
[1] 888 1 6 4 5 8 2 3,5 4,6 3,6 3,4 3
[13] 4,5,6 7 999 4,5 2,6 4,8 7,8 1,6 1,2,3 5,7,8 4,5,6,7 1,4
[25] 0 5,6,7 5,6 2,3 1,4,6,7 1,4,5
30 Levels: 0 1 1,2,3 1,4 1,4,5 1,4,6,7 1,6 2 2,3 2,6 3 3,4 3,5 3,6 4 4,5 4,5,6 4,5,6,7 4,6 4,8 ... 999
This is as much a data-cleaning protocol question as an R question...I'm doing the cleaning, but not the analysis, so everything needs to be transparent and user-friendly when I pass it back...and the PI doesn't use R. Basically I'd like to split the multiples into levels and re-name them while keeping them together as a single observation...not sure how to do this, or even if it's the right approach.
How do you generally deal with this issue? Is there an elegant way to process this for analysis in STATA (simple descriptives, regressions, odds ratios)?
Thanks everyone!!!
My best thought for analyzing multi-select questions like this is to convert the possible answers into indicator variables: take all of your possible answers (1 to 8 in this example) and create data columns named HS18.1, HS18.2, etc. (You can optionally include something more in the column name, but that's completely between you and the PI.)
Your sample data here looks like it includes data that is not legal: 0, 888, and 999 are not listed in the options. It's possible/likely that these include DK/NR responses, but I can't be certain. As such:
Your data cleaning should be taking care of these anomalies before this step of converting 0+ length lists into indicator variables.
My code below arbitrarily ignores this fact and you will lose data. This is obviously not "A Good Thing™" in the long run. More robust checks are warranted (and not difficult). (I've added an other column to indicate something was lost.)
The code:
ss <- '888 1 6 4 5 8 2 3,5 4,6 3,6 3,4 3 4,5,6 7 999 4,5 2,6 4,8 7,8 1,6 1,2,3 5,7,8 4,5,6,7 1,4 0 5,6,7 5,6 2,3 1,4,6,7 1,4,5'
dat <- lapply(strsplit(ss, ' '), strsplit, ',')[[1]]
lvls <- as.character(1:8)
## lvls <- sort(unique(unlist(dat))) # alternative method
ret <- structure(lapply(lvls, function(lvl) sapply(dat, function(xx) lvl %in% xx)),
.Names = paste0('HS18.', lvls),
row.names = c(NA, -length(dat)), class = 'data.frame')
ret$HS18.other <- sapply(dat, function(xx) !all(xx %in% lvls))
ret <- 1 * ret ## convert from TRUE/FALSE to 1/0
head(1 * ret)
## HS18.1 HS18.2 HS18.3 HS18.4 HS18.5 HS18.6 HS18.7 HS18.8 HS18.other
## 1 0 0 0 0 0 0 0 0 1
## 2 1 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 1 0 0 0
## 4 0 0 0 1 0 0 0 0 0
## 5 0 0 0 0 1 0 0 0 0
## 6 0 0 0 0 0 0 0 1 0
The resulting data.frame can be cbinded (or even matrixized) to whatever other data you have.
(I use 1 and 0 instead of TRUE and FALSE because you said the PI will not be using R; this can easily be changed to a character string or something that makes more sense to them.)

Update dataframe column efficiently using some hashmap method in R

I am new to R and can't figure out what I might be doing wrong in the code below and how I could speed it up.
I have a dataset and would like to add a column containing average value calculated from two column of data. Please take a look at the code below (WARNING: it could take some time to read my question but the code runs fine in R):
first let me define a dataset df (again I apologize for the long description of the code)
> df<-data.frame(prediction=sample(c(0,1),10,TRUE),subject=sample(c("car","dog","man","tree","book"),10,TRUE))
> df
prediction subject
1 0 man
2 1 dog
3 0 man
4 1 tree
5 1 car
6 1 tree
7 1 dog
8 0 tree
9 1 tree
10 1 tree
Next I add a the new column called subjectRate to df
df$subjectRate <- with(df,ave(prediction,subject))
> df
prediction subject subjectRate
1 0 man 0.0
2 1 dog 1.0
3 0 man 0.0
4 1 tree 0.8
5 1 car 1.0
6 1 tree 0.8
7 1 dog 1.0
8 0 tree 0.8
9 1 tree 0.8
10 1 tree 0.8
from the new table definition I generate a rateMap so as to automatically fill in new data with the subjectRate column initialized with the previously obtained average.
rateMap <- df[!duplicated(df[, c("subjectRate")]), c("subject","subjectRate")]
> rateMap
subject subjectRate
1 man 0.0
2 dog 1.0
4 tree 0.8
Now I am defining a new dataset with a combination of the old subject in df and new subjects
> dfNew<-data.frame(prediction=sample(c(0,1),15,TRUE),subject=sample(c("car","dog","man","cat","book","computer"),15,TRUE))
> dfNew
prediction subject
1 1 man
2 0 cat
3 1 computer
4 0 dog
5 0 book
6 1 cat
7 1 car
8 0 book
9 0 computer
10 1 dog
11 0 cat
12 0 book
13 1 dog
14 1 man
15 1 dog
My question: How do I create the third column efficiently? currently I am running the test below where I look up the subject rate in the map and input the value if found, or 0.5 if not.
> all_facts<-levels(factor(rateMap$subject))
> dfNew$subjectRate <- sapply(dfNew$subject,function(t) ifelse(t %in% all_facts,rateMap[as.character(rateMap$subject) == as.character(t),][1,"subjectRate"],0.5))
> dfNew
prediction subject subjectRate
1 1 man 0.0
2 0 cat 0.5
3 1 computer 0.5
4 0 dog 1.0
5 0 book 0.5
6 1 cat 0.5
7 1 car 0.5
8 0 book 0.5
9 0 computer 0.5
10 1 dog 1.0
11 0 cat 0.5
12 0 book 0.5
13 1 dog 1.0
14 1 man 0.0
15 1 dog 1.0
but with a real dataset (more than 200,000 rows) with multiple columns similar to subject to compute the average, the code takes a very long time to run. Can somebody suggest maybe a better way to do what I am trying to achieve? maybe some merge or something, but I am out of ideas.
Thank you.
I suspect (but am not sure, since I haven't tested it) that this will be faster:
dfNew$subjectRate <- rateMap$subjectRate[match(dfNew$subject,rateMap$subject)]
since it mostly uses just indexing and match. It certainly a bit simpler, I think. This will fill in the "new" values with NAs, rather than 0.5, which can then be filled in however you like with,
dfNew$subjectRate[is.na(dfNew$subjectRate)] <- newValue
If the ave piece is particularly slow, the standard recommendation these days is to use the data.table package:
require(data.table)
dft <- as.data.table(df)
setkeyv(dft, "subject")
dft[, subjectRate := mean(prediction), by = subject]
and this will probably attract a few comments suggesting ways to eke a bit more speed out of that data table aggregation in the last line. Indeed, merging or joining using pure data.tables may be even slicker (and fast), so you might want to investigate that option as well. (See the very bottom of ?data.table for a bunch of examples.)

Resources