in every example I have seen so far for reading csv files in R, the variables are in columns and the observations (individuals) are in rows. In a introductory statistics course I am taking there is an example table where the (many) variables are in rows and the (few) observations are in columns. Is there a way to read such a table so that you get a dataframe in the usual "orientation"?
Here is a solution that uses the tidyverse. First, we gather the data into narrow format tidy data, then we spread it back to wide format, setting the first column as a key for the gathered observations by excluding it from gather().
We'll demonstrate the technique with state level summary data from the U.S. Census Bureau.
I created a table of population data for four states, where the states (observations) are in columns, and the variables are listed in rows of the table.
To make the example reproducible, we entered the data into Excel and saved it as a comma separated values file, which we assign to a vector in R and read with read.csv().
textFile <- "Variable,Georgia,California,Alaska,Alabama
population2018Estimate,10519475,39557045,737438,4887871
population2010EstimatedBase,9688709,37254523,710249,4780138
pctChange2010to2018,8.6,6.2,3.8,2.3
population2010Census,8676653,37253956,710231,4779736"
# load tidyverse libraries
library(tidyr)
library(dplyr)
# first gather to narrow format then spread back to wide format
data %>%
gather(.,state,value,-Variable) %>% spread(Variable,value)
...and the results:
state pctChange2010to2018 population2010Census
1 Alabama 2.3 4779736
2 Alaska 3.8 710231
3 California 6.2 37253956
4 Georgia 8.6 8676653
population2010EstimatedBase population2018Estimate
1 4780138 4887871
2 710249 737438
3 37254523 39557045
4 9688709 10519475
Related
I have been trying to group this data frame by county however when I used pivot_wider() for the 2 casualty types, it created 2 rows for every county, I have tried fixing this but am new to R, any help would be appreciatedenter image description here
I have been looking at many solutions on this site to similar problems for weeks but cannot wrap my head around how to apply them successfully to this particular one:
I have the dataset at https://statdata.pgatour.com/r/006/player_stats.json
using:
player_stats_url<-"https://statdata.pgatour.com/r/006/player_stats.json"
player_stats_json <- fromJSON(player_stats_url)
player_stats_df <- ldply(player_stats_json,data.frame)
gives:
a dataframe of 145 rows, one for each player, and 7 columns, the 7th of which is named "players.stats" that contains the data I'd like broken out into a 2-dimensional dataframe
next, I do this to take a closer look at the "players.stats" column:
player_stats_df2<- ldply(player_stats_df$players.stats, data.frame)
the data in the "players.stats" columns are formatted as follows: rows of
25 repeating stat categories in the column (player_stats_df2$name) and another nested list in the column $rounds ... on which I repeat ldply to unnest everything but I cannot sew it back together logically in the way that I want ...
the format of the column $rounds, after unnested, using:
player_stats_df3<- ldply(player_stats_df2$rounds, data.frame)
gives the round number in the first column $r (1,2,3,4 as only choices) and then the stat value in the second column $rValue. to complicate things, some entries have 2 rounds, while others have 4 rounds
the final format of the 2-dimensional dataframe I need would have columns named players.pid and players.pn from player_stats_df, a NEW COLUMN denoting "round.no" which would correspond to player_stats_df3$r and then each of the 25 repeating stat categories from player_stats_df2$name as a column (eagles, birdies, pars ... SG: Off-the-tee, SG: tee-to-green, SG: Total) and each row being unique to a player name and round number ...
For example, there would be four rows for Matt Kuchar, one for each round played, and a column for each of the 25 stat categories ... However, some other players would only have 2 rows.
Please let me know if I can clarify this at all for this particular example- I have tried many things but cannot sew this data back together in the format I need to use it in ...
Here something you can start with, we can create a tibble using tibble::as_tibble then apply multiple unnest using tidyr::unnest
library(tidyverse)
as_tibble(player_stats_json$tournament$players) %>% unnest() %>% unnest(rounds)
Also see this tutorial here. Finally use dplyr "tidyverse" instead of plyr
I have GDP values listed by country (rows) and list of years (column headings) in one dataset. I'm trying to combine it with another dataset where the values represent GINI. How do I merge these two massive datasets by country and year, when "year" is not a variable? (How do I manipulate each dataset so that I introduce "year" as a column and have repeating countries to represent each year?
i.e. from the top dataframe to the bottom dataframe in the image?
Reshape the top dataset from wide to long and then merge with your other dataset. There are many, many, examples of reshaping data on this site with different approaches. A common one is to use the tidyr package, which has a function called gather that does just what you need.
long_table <- tidyr::gather(wide_table, key = year, value = GDP, 1960:1962)
or whatever the last year in your dataset is. You can install the tidyr package with install.packages('tidyr') if you don't have it yet.
Next time, please avoid putting pictures of your data and provide reproducible data so this is easier for others to answer exactly. You can use dput(..) to do so.
Hope this helps!
#sample data (added 'X' before numeric year columns as R doesn't allow column name to start with digit)
df <- data.frame(Country_Name=c('Belgium','Benin'),
X1960=c(123,234),
X1961=c(567,890))
library(dplyr)
library(tidyr)
df_new <- df %>%
gather(Year, GDP, -Country_Name)
df_new$Year <- gsub('X','',df_new$Year )
df_new
Output is:
Country_Name Year GDP
1 Belgium 1960 123
2 Benin 1960 234
3 Belgium 1961 567
4 Benin 1961 890
(PS: As already suggested by others you should always share sample data using dput(df))
With the data in Excel, if you have Excel 2010 or later, you can use Power Query or Get & Transform to unpivot the "year" columns.
This is the code but you can do this through the GUI
And this is the result, although I had to format the GDP column to get your mix of Scientific and Number formatting, and I had a typo on Belgium 1962
I was wondering if there is a huge performance difference/impact for large datasets when you try to subset the data.
In my scenario, I have a dataframe with just under 29,000 records/data.
When I had to subset the data, I thought of 2 ways to do this.
The data is read from a csv file using reactive.
option 1
long_lat_df <- reactive({
long_lat <- subset(readFile(), select=c(Latitude..deg.,Longitude..deg.))
return(long_lat)
})
option 2
what I had in mind was to extract the 2 columns and assign the 2 columns to its own variable long and lat. From there I can combine the 2 columns to form a new data frame where I can use it to work with spatial analysis.
Would there be a potential performance impact between the 2 options?
I'm trying to merge together two data-frames by matching their Tickers. However, in my first data-frame, some companies are listed with two tickers (as below)
Company Ticker Returns
1800-Flowers FLWS, FWC .01
First Busey Corp BUSE .02
First Bancshare Inc FBSI 0
In the second data frame, there is only one ticker.
Ticker Other Info
FLWS 50
BUSE 60
FBSI 20
How can I merge these two files together in R so that it will recognize that the data from FLWS belongs to the first row in data frame 1 because it contains the Ticker?
Please note that in data frame 1, most companies have only 1 ticker listed, many have 2, and some companies have 3 or 4 tickers.