My first question here...
I have 2 dataframes, both with a different number of rows.
The first one has 3 columns, the second one has 1 column.
I want to make all combinations of values from the 1st column of the 1st dataframe with values in the 1st (and only) column of the second dataframe, and values of 2nd column of 1st dataframe with values in 1st (and only) column of second dataframe, and so on...
I assume the result will be a one-column dataframe (?).
Something like this:
Attempts with combn did not help me yet...
Thanks!
Probably not fully what you want, but provides a starting point. Providing your first dataframe is called df and the other one (with one column) df2
#make data long using tidyr
df_long <- tidyr::pivot_longer(df, cols = c("loc1", "loc2", "loc3"))
#cartesian join with codes column
CJ(df_long$value, df2)
Related
I have two columns in a dataframe that contain date information after a left outer join. Because of the style of join, one of the date columns now contains NAs. I want to check if all non-NA values are identical between these columns. An example is below:
date 1 date 2
1/1/21 NA
1/2/21 1/2/21
1/3/21 NA
1/4/21 1/4/21
I don't need the second column if all non-NA values match
Before I did the left outer join, I did a outer join and this statement:
identical(df[['date 1']], df[['date 2']])
returned a true as each row in both columns were indeed identical
Is there a way to use this or a similar statement while ignoring all rows that contain an "NA" in "date 2"?
You can test for null values and mismatched values by filtering your df, then check whether there are any.
df_mismatch = df[(df['date 2'].notnull()) & (df['date 1'] != df['date 2'])]
if len(df_mismatch) > 0:
print('found this many mismatches:', len(df_mismatch))
I found a workaround:
first, create a new dataframe that just stores these two columns. The reason you'll want to make a new dataframe is because we will use na.omit in the next step which will remove any row that contains 1 or more "NA" in any column.
df2 <- df[, c("date 1", "date2")]
Then remove all rows that contain an "NA" in any column
df2 <- na.omit(df2)
Finally, run identical to check if the remaining columns are indeed identical
identical(df2[['date1']], df2[['date2']])
I'm sure there is a more elegant way, but this worked for me in the meantime
I have a dataframe known as Tgame containing two columns game and hours_played. I am trying to remove duplicates in the column game and also sum up the average for column hours_played for game column.
Should be as simple as this (using data.table):
library(data.table)
setDT(Tgame)[, mean(hours_played), by = game]
I have been looking at many solutions on this site to similar problems for weeks but cannot wrap my head around how to apply them successfully to this particular one:
I have the dataset at https://statdata.pgatour.com/r/006/player_stats.json
using:
player_stats_url<-"https://statdata.pgatour.com/r/006/player_stats.json"
player_stats_json <- fromJSON(player_stats_url)
player_stats_df <- ldply(player_stats_json,data.frame)
gives:
a dataframe of 145 rows, one for each player, and 7 columns, the 7th of which is named "players.stats" that contains the data I'd like broken out into a 2-dimensional dataframe
next, I do this to take a closer look at the "players.stats" column:
player_stats_df2<- ldply(player_stats_df$players.stats, data.frame)
the data in the "players.stats" columns are formatted as follows: rows of
25 repeating stat categories in the column (player_stats_df2$name) and another nested list in the column $rounds ... on which I repeat ldply to unnest everything but I cannot sew it back together logically in the way that I want ...
the format of the column $rounds, after unnested, using:
player_stats_df3<- ldply(player_stats_df2$rounds, data.frame)
gives the round number in the first column $r (1,2,3,4 as only choices) and then the stat value in the second column $rValue. to complicate things, some entries have 2 rounds, while others have 4 rounds
the final format of the 2-dimensional dataframe I need would have columns named players.pid and players.pn from player_stats_df, a NEW COLUMN denoting "round.no" which would correspond to player_stats_df3$r and then each of the 25 repeating stat categories from player_stats_df2$name as a column (eagles, birdies, pars ... SG: Off-the-tee, SG: tee-to-green, SG: Total) and each row being unique to a player name and round number ...
For example, there would be four rows for Matt Kuchar, one for each round played, and a column for each of the 25 stat categories ... However, some other players would only have 2 rows.
Please let me know if I can clarify this at all for this particular example- I have tried many things but cannot sew this data back together in the format I need to use it in ...
Here something you can start with, we can create a tibble using tibble::as_tibble then apply multiple unnest using tidyr::unnest
library(tidyverse)
as_tibble(player_stats_json$tournament$players) %>% unnest() %>% unnest(rounds)
Also see this tutorial here. Finally use dplyr "tidyverse" instead of plyr
I need help to find the best way to convert the table below using the conditions:
If...
the data of the 1st column (plot number) and
the data of the 2st column (subplot number) and
the data of the 3rd column (trees) and
the name of the tree in the 4th column (tree_species) and
the data of the 5th column (stems)
are the SAME in different rows the new column dbh_equivalent will be result of the function:
=SQRT(dbc_cm - row1^2+dbc_cm-row2^2+...+dbc_cm- row n^2).
That is, in the table above the result would be:
Thanks
I need to extract the columns from a dataset without header names.
I have a ~10000 x 3 data set and I need to plot the first column against the second two.
I know how to do it when the columns have names ~ plot(data$V1, data$V2) but in this case they do not. How do I access each column individually when they do not have names?
Thanks
Why not give them sensible names?
names(data)=c("This","That","Other")
plot(data$This,data$That)
That's a better solution than using the column number, since names are meaningful and if your data changes to have a different number of columns your code may break in several places. Give your data the correct names and as long as you always refer to data$This then your code will work.
I usually select columns by their position in the matrix/data frame.
e.g.
dataset[,4] to select the 4th column.
The 1st number in brackets refers to rows, the second to columns. Here, I didn't use a "1st number" so all rows of column 4 are selected, i.e., the whole column.
This is easy to remember since it stems from matrix calculations. E.g., a 4x3 dimensional matrix has 4 rows and 3 columns. Thus when I want to select the 1st row of the third column, I could do something like matrix[1,3]