I have been looking at many solutions on this site to similar problems for weeks but cannot wrap my head around how to apply them successfully to this particular one:
I have the dataset at https://statdata.pgatour.com/r/006/player_stats.json
using:
player_stats_url<-"https://statdata.pgatour.com/r/006/player_stats.json"
player_stats_json <- fromJSON(player_stats_url)
player_stats_df <- ldply(player_stats_json,data.frame)
gives:
a dataframe of 145 rows, one for each player, and 7 columns, the 7th of which is named "players.stats" that contains the data I'd like broken out into a 2-dimensional dataframe
next, I do this to take a closer look at the "players.stats" column:
player_stats_df2<- ldply(player_stats_df$players.stats, data.frame)
the data in the "players.stats" columns are formatted as follows: rows of
25 repeating stat categories in the column (player_stats_df2$name) and another nested list in the column $rounds ... on which I repeat ldply to unnest everything but I cannot sew it back together logically in the way that I want ...
the format of the column $rounds, after unnested, using:
player_stats_df3<- ldply(player_stats_df2$rounds, data.frame)
gives the round number in the first column $r (1,2,3,4 as only choices) and then the stat value in the second column $rValue. to complicate things, some entries have 2 rounds, while others have 4 rounds
the final format of the 2-dimensional dataframe I need would have columns named players.pid and players.pn from player_stats_df, a NEW COLUMN denoting "round.no" which would correspond to player_stats_df3$r and then each of the 25 repeating stat categories from player_stats_df2$name as a column (eagles, birdies, pars ... SG: Off-the-tee, SG: tee-to-green, SG: Total) and each row being unique to a player name and round number ...
For example, there would be four rows for Matt Kuchar, one for each round played, and a column for each of the 25 stat categories ... However, some other players would only have 2 rows.
Please let me know if I can clarify this at all for this particular example- I have tried many things but cannot sew this data back together in the format I need to use it in ...
Here something you can start with, we can create a tibble using tibble::as_tibble then apply multiple unnest using tidyr::unnest
library(tidyverse)
as_tibble(player_stats_json$tournament$players) %>% unnest() %>% unnest(rounds)
Also see this tutorial here. Finally use dplyr "tidyverse" instead of plyr
Related
I'm just 2 days into R so I hope I can give enough Info on my problem.
I have an Excel Table on Endothelial Cell Angiogenesis with Technical Repeats on 4 different dates. (But those Dates are not in order and in different weeks)
My Data looks like this (of course its not only the 2nd of March):
I want to average the data on those 4 different days, so I can compare i.e the "Nb Nodes" from day 1 to day 4.
So to finally have a jitterplot containing the group, the investigated Data Point and the date.
I'm a medical student so I dont really have yet any knowledge about this kind of stuff but Im trying to learn it. Hopefully I provided enough Info!
Found the solution:
#Group by
library(dplyr)
DateGroup <- group_by(Exclude0, Exp.Date, Group)
#Summarizing the mean in every Group and Date
summarise(DateGroup, mymean = mean(Date$`Nb meshes`))
I think the below code will work.
group_by the dimension you want to summarize by
2a. across() is helper verb so that you don't need to manually type each column specifically, it allows us to use tidy select language so that we can quickly reference columns that contains "Nb" (a pattern that I noticed from your screenshot)
2b. With across(), second argument, you then use formula that you want to apply to each column from the first argument of across()
2c. Optional argument in across so that the new columns names have a name convention)
Good luck on your R learning! It's a really great language and you made the right choice.
#df is your data frame
df %>% group_by(Exp.Date) %>%
summarize(across(contains("Nb"),mean,.names = {.fn}_{.col}))
#if you just want a single column then do this
df %>% group_by(Exp.Date) %>%
summarize(mean_nb_nodes=mean(`Nb nodes`))
I have a tibble with 1755 rows.
The other question on my profile relates to setting this up.
The columns include a variable number of columns with name format "C1L", "C1H", "C2L" etc. always starting with c, no other columns start with c, and a column named "DI". I would like to nest these columns.
I run this code:
fullfile <- fullfile %>%
nest(alleles = c(starts_with("C", ignore.case = FALSE), "DI"))
and get an output tibble with 1742 rows.
Looking in more detail, a subset of rows have two sets of data in the "alleles" column.
The affected rows are spread through the dataset, not clustered.
This is data from 16 groups, and each group has a probability related to each row. This give me an easy measure - summing the probability column before the nest gives 16, afterwards gives 15.99826 so I'm definitely losing data, not just empty rows.
I'm looking for advice on what I can do to narrow down the cause of this issue.
I can't upload the example as I don't have permission to share the data I'm afraid.
I have a data frame consisting of monthly volumes beginning 2004-01-01 and ending 2019-12-01. I need to apply a filter that will delete rows that equal certain dates. My problem is that there's 28 dates I need to filter out and they arent consecutive. What I have right now works, but isn't efficient. I am using dplyr's filter function.
I currently have 28 variables, d1-d28, which are the dates that I would like filtered out and then I use
df<-data%>%dplyr::filter(Date!=d1 & Date!=d2 & Date!=d3 .......Date!=d28)
I would like to put the dates of interest, the d1-d28, into a data.frame and just reference the data.frame in my filter code.
I've tried:
df<-data%>%dplyr::filter(!Date %in% DateFilter)
Where DateFilter is a data.frame with 1 column and 28 rows of the dates I want filtered, but I get an an error where it says the length of the objects don't match.
Is there any way I can do this with dplyr?
Here, we may use filter_at
library(dplyr)
data %>%
filter_at(vars(matches('^d\\d+$')), all_vars(Date != .))
I need to create a bunch of subset data frames out of a single big df, based on a date column (e.g. - "Aug 2015" in month-Year format). It should be something similar to the subset function, except that the count of subset dfs to be formed should change dynamically depending upon the available values on date column
All the subsets data frames need to have similar structure, such that the date column value will be one and same for each and every subset df.
Suppose, If my big df currently has last 10 months of data, I need 10 subset data frames now, and 11 dfs if i run the same command next month (with 11 months of base data).
I have tried something like below. but after each iteration, the subset subdf_i is getting overwritten. Thus, I am getting only one subset df atlast, which is having the last value of month column in it.
I thought that would be created as 45 subset dfs like subdf_1, subdf_2,... and subdf_45 for all the 45 unique values of month column correspondingly.
uniqmnth <- unique(df$mnth)
for (i in 1:length(uniqmnth)){
subdf_i <- subset(df, mnth == uniqmnth[i])
i==i+1
}
I hope there should be some option in the subset function or any looping might do. I am a beginner in R, not sure how to arrive at this.
I think the perfect solution for this might be use of assign() for the iterating variable i, to get appended in the names of each of the 45 subsets. Thanks for the note from my friend. Here is the solution to avoid the subset data frame being overwritten each run of the loop.
uniqmnth <- unique(df$mnth)
for (i in 1:length(uniqmnth)){
assign(paste("subdf_",i,sep=""), subset(df, mnth == uniqmnth[i])) i==i+1
}
I hope you won't find my question too silly, i did a lot of research but it seems that i can't figure how to solve this really annoying issue.
Well, i have datas for 6 participants (P) in an experiment, with 50 trials (T) per participants and 10 condition (C). So i'd like to create a dataframe in r allowing me to put these datas.
This data.frame should have 3 factors (P, T and C) and so a number of total row of (P*T*C). The difficulty for me is to create this one, since i have the datas for the 6 participant in 6 data.frame of 100 obs(T) by 10 varibles(C).
I'd like first to create the empty dataset with these factors, and then copy the values of the 6 data.set according to the factors P, T and C.
Any help would be greatly appreciated, i'm novice in r.
Thank you.
OK; First we create one big dataframe for all participants:
result<-rbind(dfrforparticipant1, dfrforparticipant2,...dfrforparticipant6) #you'll have to fill out the proper names of the original data.frames
Next, we add a column for the participant ID:
numTrials<-50 #although 100 is also mentioned in your question
result$P<-as.factor(rep(1:6, each=numTrials))
Finally, we need to go from 'wide' format to 'long' format (I'm assuming your column names holding the results for each condition are called C1, C2 etc. ; I'm also assuming your original data.frames already held a column named T to denote the trial), like this (untested, since you did not provide example data):
orgcolnames<-paste("C", 1:10, sep="")
result2<-reshape(result, varying=list(orgcolnames), v.names="val", idvar=c("T","P"), timevar="C", times=seq_along(orgcolnames), direction="long")
What you want is now in result2.