R: Adding a column of a conditional observation count [duplicate] - r

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 3 years ago.
I am looking to add a column to my data that will list the individual count of the observation in the dataset. I have data on NBA teams and each of their games. They are listed by date, and I want to create a column that lists what # in each season each game is for each team.
My data looks like this:
# gmDate teamAbbr opptAbbr id
# 2012-10-30 WAS CLE 2012-10-30WAS
# 2012-10-30 CLE WAS 2012-10-30CLE
# 2012-10-30 BOS MIA 2012-10-30BOS
Commas separate each column
I've tried to use "add_count" but this has provided me with the total # of games each team has played in the dataset.
Prior attempts:
nba_box %>% add_count()
I expect the added column to display the # game for each team (1-82), but instead it now shows the total number of games in the dataset (82).

Here is a base R example that approaches the problem from a for loop standpoint. Given that a team can be either column, we keep track of the teams position by unlisting the data and using the table function to sum the previous rows.
# intialize some fake data
test <- as.data.frame(t(replicate(6, sample( LETTERS[1:3],2))),
stringsAsFactors = F)
colnames(test) <- c("team1","team2")
# initialize two new columns
test$team2_gamenum <- test$team1_gamenum <- NA
count <- NULL
for(i in 1:nrow(test)){
out <- c(count, table(unlist(test[i,c("team1","team2")])))
count <- table(rep(names(out), out)) # prob not optimum way of combining two table results
test$team1_gamenum[i] <- count[which(names(count) == test[i,1])]
test$team2_gamenum[i] <- count[which(names(count) == test[i,2])]
}
test
# team1 team2 team1_gamenum team2_gamenum
#1 B A 1 1
#2 A C 2 1
#3 C B 2 2
#4 C B 3 3
#5 A C 3 4
#6 A C 4 5

Related

Changing elements in one data frame [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 1 year ago.
I have a df with two columns where the elements in them are codes:
> head(listaNombres)
ocupacion1 ocupacion2
1 11-2020 11-9190
2 11-2020 41-1010
3 11-2020 41-2030
4 11-2020 41-3090
5 11-2020 41-4010
6 11-3030 11-9190
And then a separate df with the meaning for each code:
> head(descripcion)
# A tibble: 6 x 2
broadGroup Desc
<chr> <chr>
1 11-1010 Chief Executives
2 11-1020 General and Operations Managers
3 11-1030 Legislators
4 11-2010 Advertising and Promotions Managers
5 11-2020 Marketing and Sales Managers
6 11-2030 Public Relations and Fundraising Managers
How can I convert the codes in the first df with the Desc column in the second?
This question has been answered a few times but non seem to have an answer that is simple and also uses base R. I'm not a fan of making people use unnecessary packages, so I'll write this up since there is an easy and straight forward solution that requires no extra packages.
Using the 'match' function we can
oldvalues <- descripcion$broadGroup
# sets up values we wish to change from
newvalues <- descripcion$Desc
# sets up the values we want to change to
listaNombres$ocupacion1 = newvalues[ match(listaNombres$ocupacion1, oldvalues) ]
# Overwrite current ocupacion1 values with desired recode
listaNombres$ocupacion2 = newvalues[ match(listaNombres$ocupacion2, oldvalues) ]
# Overwrite current ocupacion2 values with desired recode
Say we have
v3$recode = v2[ match(v3$recod, v1) ]
What this does is, is takes our three vectors, v1, v2 and v3 and using match(v3,v1), match returns a vector of positions in v1 where the first match between an element of v3 and v1 occurs. We then select elements from v2 using this vector of positions, which gives us the recoded version of v3$record. We then feed this recoded vector of values straight back into v3$record overwriting the old values.
edit: I've since had a look using R and this solution works using the following mockup dataset
ocupacion1 = c(1,2,3,4)
ocupacion2 = c(3,4,4,2)
listaNombres = data.frame(ocupacion1,ocupacion2)
broadGroup = c(1,2,3,4)
Desc = c("one","two","three","four")
descripcion = data.frame(broadGroup,Desc)
combining everything gives the following
> ocupacion1 = c(1,2,3,4)
> ocupacion2 = c(3,4,4,2)
> listaNombres = data.frame(ocupacion1,ocupacion2)
> head(listaNombres)
ocupacion1 ocupacion2
1 1 3
2 2 4
3 3 4
4 4 2
> broadGroup = c(1,2,3,4)
> Desc = c("one","two","three","four")
> descripcion = data.frame(broadGroup,Desc)
> head(descripcion)
broadGroup Desc
1 1 one
2 2 two
3 3 three
4 4 four
> oldvalues <- descripcion$broadGroup
> newvalues <- descripcion$Desc
> listaNombres$ocupacion1 = newvalues[ match(listaNombres$ocupacion1, oldvalues) ]
> listaNombres$ocupacion2 = newvalues[ match(listaNombres$ocupacion2, oldvalues) ]
> # Overwrite current ocupacion2 values with desired recode
> head(listaNombres)
ocupacion1 ocupacion2
1 one three
2 two four
3 three four
4 four two

R: change one value every row in big dataframe

I just started working with R for my master thesis and up to now all my calculations worked out as I read a lot of questions and answers here (and it's a lot of trial and error, but thats ok).
Now i need to process a more sophisticated code and i can't find a way to do this.
Thats the situation: I have multiple sub-data-sets with a lot of entries, but they are all structured in the same way. In one of them (50000 entries) I want to change only one value every row. The new value should be the amount of the existing entry plus a few values from another sub-data-set (140000 entries) where the 'ID'-variable is the same.
As this is the third day I'm trying to solve this, I already found and tested for and apply but both are running for hours (canceled after three hours).
Here is an example of one of my attempts (with for):
for (i in 1:50000) {
Entry_ID <- Sub02[i,4]
SUM_Entries <- sum(Sub03$Source==Entry_ID)
Entries_w_ID <- subset(Sub03, grepl(Entry_ID, Sub03$Source)) # The Entry_ID/Source is a character
Value1 <- as.numeric(Entries_w_ID$VAL1)
SUM_Value1 <- sum(Value1)
Value2 <- as.numeric(Entries_w_ID$VAL2)
SUM_Value2 <- sum(Value2)
OLD_Val1 <- Sub02[i,13]
OLD_Val <- as.numeric(OLD_Val1)
NEW_Val <- SUM_Entries + SUM_Value1 + SUM_Value2 + OLD_Val
Sub02[i,13] <- NEW_Val
}
I know this might be a silly code, but thats the way I tried it as a beginner. I would be very grateful if someone could help me out with this so I can get along with my thesis.
Thank you!
Here's an example of my data-structure:
Text VAL0 Source ID VAL1 VAL2 VAL3 VAL4 VAL5 VAL6 VAL7 VAL8 VAL9
XXX 12 456335667806925_1075080942599058 10153901516433434_10153902087098434 4 1 0 0 4 9 4 6 8
ABC 8 456335667806925_1057045047735981 10153677787178434_10153677793613434 6 7 1 1 5 3 6 8 11
DEF 8 456747267806925_2357045047735981 45653677787178434_94153677793613434 5 8 2 1 5 4 1 1 9
The output I expect is an updated value 'VAL9' in every row.
From what I understood so far, you need 2 things:
sum up some values in one dataset
add them to another dataset, using an ID variable
Besides what #yoland already contributed, I would suggest to break it down in two separate tasks. Consider these two datasets:
a = data.frame(x = 1:2, id = letters[1:2], stringsAsFactors = FALSE)
a
# x id
# 1 1 a
# 2 2 b
b = data.frame(values = as.character(1:4), otherid = letters[1:2],
stringsAsFactors = FALSE)
sapply(b, class)
# values otherid
# "character" "character"
Values is character now, we need to convert it to numeric:
b$values = as.numeric(b$values)
sapply(b, class)
# values otherid
# "numeric" "character"
Then sum up the values in b (grouped by otherid):
library(dplyr)
b = group_by(b, otherid)
b = summarise(b, sum_values = sum(values))
b
# otherid sum_values
# <chr> <dbl>
# 1 a 4
# 2 b 6
Then join it with a - note that identifiers are specified in c():
ab = left_join(a, b, by = c("id" = "otherid"))
ab
# x id sum_values
# 1 1 a 4
# 2 2 b 6
We can then add the result of the sum from b to the variable x in a:
ab$total = ab$x + ab$sum_values
ab
# x id sum_values total
# 1 1 a 4 5
# 2 2 b 6 8
(Updated.)
From what I understand you want to create a new variable that uses information from two different data sets indexed by the same ID. The easiest way to do this is probably to join the data sets together (if you need to safe memory, just join the columns you need). I found dplyr's join functions very handy for these cases (explained neatly here) Once you joined the data sets into one, it should be easy to create the new columns you need. e.g.: df$new <- df$old1 + df$old2

Select random rows from duplicate IDS

I'm dealing with a dataset where I have students ratings of teachers. Some students rated the same teacher more than once.
What I would like to do with the data is to subset it with the following criteria:
1) Keep any unique student Ids and ratings
2) In cases where students rated a teacher twice keep only 1 rating, but to select which rating to keep randomly.
3) If possible I'd like to be able to run the code in a munging script at the top of every analysis file and ensure that the dataset created is exaclty the same for each analysis (set seed?).
# data
student.id <- c(1,1,2,3,3,4,5,6,7,7,7,8,9)
teacher.id <- c(1,1,1,1,1,2,2,2,2,2,2,2,2)
rating <- c(100,99,89,100,99,87,24,52,100,99,89,79,12)
df <- data.frame(student.id,teacher.id,rating)
Thanks for any guidance for how to move forward.
Assuming that each student.id is only applied to one teacher, you could use the following method.
# get a list containing data.frames for each student
myList <- split(df, df$student.id)
# take a sample of each data.frame if more than one observation or the single observation
# bind the result together into a data.frame
set.seed(1234)
do.call(rbind, lapply(myList, function(x) if(nrow(x) > 1) x[sample(nrow(x), 1), ] else x))
This returns
student.id teacher.id rating
1 1 1 100
2 2 1 89
3 3 1 99
4 4 2 87
5 5 2 24
6 6 2 52
7 7 2 99
8 8 2 79
9 9 2 12
If the same student.id rates multiple teachers, then this method requires the construction of a new variable with the interaction function:
# create new interaction variable
df$stud.teach <- interaction(df$student.id, df$teacher.id)
myList <- split(df, df$stud.teach)
then the remainder of the code is identical to that above.
A potentially faster method is to use the data.table library and rbindlist.
library(data.table)
# convert into a data.table
setDT(df)
myList <- split(df, df$stud.teach)
# put together data.frame with rbindlist
rbindlist(lapply(myList, function(x) if(nrow(x) > 1) x[sample(nrow(x), 1), ] else x))
This can now be done much faster using data.table. Your question is equivalent to sampling rows from within groups, see
Sample random rows within each group in a data.table

rowSums using an indirect variable (i.e. using a string variable to allocate the column numbers)

I'm still pretty much a newbie in R but enjoying the journey so far. I'm trying to group weekly columns together into quarters, and try to create a more elegant solution rather than creating separate lines to assign values.
So I have created a list of values to contain the column ranges, e.g. Q1 <- 5:9, Q2 <- 10:22, and so forth. After reading the original data frame, I want to create a new one that has Q1 as the variable, and contains the total of column 5-9, Q2 with the total of 10:22, etc. The problem is, rowSums doesn't like me using a variable to denote the actual range.
This is what I am trying to achieve, with sval containing the original weekly data, and qsval, containing the quarterly totals:
Q110 <- 5:9
Q210 <- 10:22
Q310 <- 23:35
Q410 <- 36:48
Q111 <- 49:61
Q211 <- 62:74
Q311 <- 75:87
Q411 <- 88:100
qsval <- sval[,c(1:4)] # Copying the first four columns from the weekly data
period <- c('Q110','Q210','Q310','Q410','Q111','Q211','Q311','Q411')
for (i in 1:8) {
assign(qsval$period[i], rowSums(sval,na.rm=F, get(period[i])))
}
Is this possible at all? The error message given is:
Error in rowSums(sval, na.rm = F, get(period[i])) : invalid 'dims'
Any advice would be much appreciated! Thank you.
In the absence of reproducible data, here's an example which hopefully you can adapt to your specific case:
set.seed(1) # just to make the random data reproducible
sval <- data.frame(replicate(6,sample(1:3)))
# X1 X2 X3 X4 X5 X6
#1 1 3 3 1 3 2
#2 3 1 2 3 1 3
#3 2 2 1 2 2 1
Qlist <- list(Q1=1:3,Q2=4:6)
qsval <- data.frame(lapply(Qlist, function(x) rowSums(sval[x]) ))
# Q1 Q2
#1 7 6
#2 6 7
#3 5 5

count frequency of rows based on a column value in R

I understand that this is quite a simple question, but I haven't been able to find an answer to this.
I have a data frame which gives you the id of a person and his hobby. Since a person may have many hobbies, the id field may be repeated in multiple rows, each with a different hobby. I have been trying to print out only those rows which have more than one hobbies. I was able to get the frequencies using table.
But how do I apply the condition to print only when the frequency is greater than one.
Secondly, is there a better way to find frequencies without using table.
This is my attempt with table without the filter for frequency greater than one
> id=c(1,2,2,3,2,4,3,1)
> hobby = c('play','swim','play','movies','golf','basketball','playstation','gameboy')
> df = data.frame(id, hobby)
> table(df$id)
1 2 3 4
2 3 2 1
Try using data table, I find it more readable than using table() functions:
library(data.table)
id=c(1,2,2,3,2,4,3,1)
hobby = c('play','swim','play','movies',
'golf','basketball','playstation','gameboy')
df = data.frame(id=id, hobby=hobby)
dt = as.data.table(df)
dt[,hobbies:=.N, by=id]
You will get, for your condition:
> dt[hobbies >1,]
id hobby hobbies
1: 1 play 2
2: 2 swim 3
3: 2 play 3
4: 3 movies 2
5: 2 golf 3
6: 3 playstation 2
7: 1 gameboy 2
This example assumes you are trying to filter df
id=c(1,2,2,3,2,4,3,1)
hobby = c('play','swim','play','movies','golf','basketball',
'playstation','gameboy')
df = data.frame(id, hobby)
table(df$id)
Get all those ids that have more than one hobby
tmp <- as.data.frame(table(df$id))
tmp <- tmp[tmp$Freq > 1,]
Using that information - select their IDs in df
df1 <- df[df$id %in% tmp$Var1,]
df1

Resources