How to view all rows of an output thats not in table form - r

Problem
I started with an ungrouped data set which I proceeded to group, the output of the grouping however, does not return all 427 rows. The output is needed to input that data into a table.
So initially the data was ungrouped and appears as follows:
Occupation Education Age Died
1 household Secondary 39 no
2 farming primary 83 yes
3 farming primary 60 yes
4 farming primary 73 yes
5 farming Secondary 51 no
6 farming iliterate 62 yes
The data is then grouped as follows:
occu %>% group_by(Occupation, Died, Age) %>% count()##use this to group on the occupation of the suicide victimrs
which gives the following output:
Occupation Died Age n
<fct> <fct> <int> <int>
1 business/service no 20 2
2 business/service no 30 1
3 business/service no 31 2
4 business/service no 34 1
5 business/service no 36 2
6 business/service no 41 1
7 business/service no 44 1
8 business/service no 46 1
9 business/service no 84 1
10 business/service yes 21 1
# ... with 417 more rows
problem is i need all the rows in order to input the grouped data into a table using:
dt <- read.table(text="full output from above")
If any more code would be useful to solving this let me know.

It is not really clear what you want but try the following code :
occu %>% group_by(Occupation, Died, Age) %>% count()
dt <- as.data.frame(occu)
It seems you simply want to convert the tibble to a data frame. There is no need to print all the output and then copy-paste it into read.table().
Also if you need you can save your output with write.table(dt,"filename.txt"), it will create a .txt file with your data.
If what you want is really print all the tibble output in the console, then you can do the following code, as suggested by Akrun's link :
options(dplyr.print_max = 1e9)
It will allow R to print the full tibble into the console, which I think is not efficient to do what you are asking.

Related

How to select random rows from R data frame to include all distinct values of two columns

I want to select a random sample of rows from a large R data frame df (around 10 million rows) in such a way that all distinct values of two columns are included in the resulting sample. df looks like:
StoreID WEEK Units Value ProdID
2001 1 1 3.5 20702
2001 2 2 3 20705
2002 32 3 6 23568
2002 35 5 15 24025
2003 1 2 10 21253
I have the following unique values in the respective columns: StoreID: 1433 and WEEK: 52. When I generate a random sample of rows from df, I must have at least one row each for each StoreID and each WEEK value.
I used the function sample_frac in dplyr in various trials but that does not ensure that all distinct values of StoreID and WEEK are included at least once in the resulting sample. How can I achieve what I want?
It sounds like you need to group the desired columns before sampling rows. The last line will return one random row for each unique storeID-week pairing.
df <- data.frame(storeid=sample(c(2000:2010),1000,T),
week=sample(c(1:52),1000,T),
value=runif(1000))
# count number of duplicated storeid-week pairs
df %>% count(storeid,week) %>% filter(n>1)
df %>% group_by(storeid,week) %>% sample_n(1)
# A tibble: 468 x 3
# Groups: storeid, week [468]
storeid week value
<int> <int> <dbl>
1 2000 1 0.824
2 2000 2 0.0987
3 2000 6 0.916
4 2000 8 0.289
5 2000 9 0.610
6 2000 11 0.0807
7 2000 12 0.592
8 2000 13 0.849
9 2000 14 0.0181
10 2000 16 0.182
# ... with 458 more rows
Not sure if I have read the problem correctly. I would have tried the following using sample function.
Assuming your dataframe is called MyDataFrame and is two dimensional, I would have done it like this.
RandomizedDF <- MyDataFrame[sample(dim(MyDataFrame)[1],dim(MyDataFrame)[1],replace=FALSE),]
Let me know if this is what you wanted or something else?

Is there a way to filter that does not include duplicates/repeated entries by particular groups?

Some context first:
I'm working with a data set which includes health related data. It includes questionnaire scores pre and post treatment. However, some clients reappear within the data for further treatment. I've provided a mock example of the data in the code section.
I have tried to come up with a solution on dplyr as this is package I'm most familiar with, but I didn't achieve what I've wanted.
#Example/mock data
ClientNumber<-c("4355", "2231", "8894", "9002", "4355", "2231", "8894", "9002", "4355", "2231")
Pre_Post<-c(1,1,1,1,2,2,2,2,1,1)
QuestionnaireScore<-c(62,76,88,56,22,30, 35,40,70,71)
df<-data.frame(ClientNumber, Pre_Post, QuestionnaireScore)
df$ClientNumber<-as.character(df$ClientNumber)
df$Pre_Post<-as.factor(df$Pre_Post)
View(df)
#tried solution
df2<-df%>%
group_by(ClientNumber)%>%
filter( Pre_Post==1|Pre_Post==2)
#this doesn't work, or needs more code to it
As you can see, the first four client numbers both have a pre and post treatment score. This is good. However, client numbers 4355 and 2231 appear again at the end (you could say they have relapsed and started new treatment). These two clients do not have a post treatment score.
I only want to analyse clients that have a pre and post score, therefore I need to filter clients which have completed treatment, while excluding ones that do not have a post treatment score if they have appeared in the data again. In relation to the example I've provided, I want to include the first 8 for analysis while excluding the last two, as they do not have a post treatment score.
If these cases are to be kept in order, you could try:
library(dplyr)
df %>%
group_by(ClientNumber) %>%
filter(!duplicated(Pre_Post) & n_distinct(Pre_Post) == 2)
ClientNumber Pre_Post QuestionnaireScore
<fct> <dbl> <dbl>
1 4355 1 62
2 2231 1 76
3 8894 1 88
4 9002 1 56
5 4355 2 22
6 2231 2 30
7 8894 2 35
8 9002 2 40
I don't know if you actually need to use n_distinct() but it won't hurt to keep it. This will remove cases who have a pre score but no post score if they exist in the data.
First arrange ClientNumbers then group_by and finally filter using dplyr::lead and dplyr::lag
library(dplyr)
df %>% arrange(ClientNumber) %>% group_by(ClientNumber) %>%
filter(Pre_Post==1 & lead(Pre_Post)==2 | Pre_Post==2 & lag(Pre_Post)==1)
# A tibble: 8 x 3
# Groups: ClientNumber [4]
ClientNumber Pre_Post QuestionnaireScore
<fct> <dbl> <dbl>
1 2231 1 76
2 2231 2 30
3 4355 1 62
4 4355 2 22
5 8894 1 88
6 8894 2 35
7 9002 1 56
8 9002 2 40
Another option is to create groups of 2 for every ClientNumber and select only those groups which have 2 rows in them.
library(dplyr)
df %>%
arrange(ClientNumber) %>%
group_by(ClientNumber, group = cumsum(Pre_Post == 1)) %>%
filter(n() == 2) %>%
ungroup() %>%
select(-group)
# ClientNumber Pre_Post QuestionnaireScore
# <chr> <fct> <dbl>
#1 2231 1 76
#2 2231 2 30
#3 4355 1 62
#4 4355 2 22
#5 8894 1 88
#6 8894 2 35
#7 9002 1 56
#8 9002 2 40
The same can be translated in base R using ave
new_df <- df[order(df$ClientNumber), ]
subset(new_df, ave(Pre_Post,ClientNumber,cumsum(Pre_Post == 1),FUN = length) == 2)

Struggling to Create a Pivot Table in R

I am very, very new to any type of coding language. I am used to Pivot tables in Excel, and trying to replicate a pivot I have done in Excel in R. I have spent a long time searching the internet/ YouTube, but I just can't get it to work.
I am looking to produce a table in which I the left hand side column shows a number of locations, and across the top of the table it shows different pages that have been viewed. I want to show in the table the number of views per location which each of these pages.
The data frame 'specificreports' shows all views over the past year for different pages on an online platform. I want to filter for the month of October, and then pivot the different Employee Teams against the number of views for different pages.
specificreports <- readxl::read_excel("Multi-Tab File - Dashboard
Usage.xlsx", sheet = "Specific Reports")
specificreportsLocal <- tbl_df(specificreports)
specificreportsLocal %>% filter(Month == "October") %>%
group_by("Employee Team") %>%
This bit works, in that it groups the different team names and filters entries for the month of October. After this I have tried using the summarise function to summarise the number of hits but can't get it to work at all. I keep getting errors regarding data type. I keep getting confused because solutions I look up keep using different packages.
I would appreciate any help, using the simplest way of doing this as I am a total newbie!
Thanks in advance,
Holly
let's see if I can help a bit. It's hard to know what your data looks like from the info you gave us. So I'm going to guess and make some fake data for us to play with. It's worth noting that having field names with spaces in them is going to make your life really hard. You should start by renaming your fields to something more manageable. Since I'm just making data up, I'll give my fields names without spaces:
library(tidyverse)
## this makes some fake data
## a data frame with 3 fields: month, team, value
n <- 100
specificreportsLocal <-
data.frame(
month = sample(1:12, size = n, replace = TRUE),
team = letters[1:5],
value = sample(1:100, size = n, replace = TRUE)
)
That's just a data frame called specificreportsLocal with three fields: month, team, value
Let's do some things with it:
# This will give us total values by team when month = 10
specificreportsLocal %>%
filter(month == 10) %>%
group_by(team) %>%
summarize(total_value = sum(value))
#> # A tibble: 4 x 2
#> team total_value
#> <fct> <int>
#> 1 a 119
#> 2 b 172
#> 3 c 67
#> 4 d 229
I think that's sort of like what you already did, except I added the summarize to show how it works.
Now let's use all months and reshape it from 'long' to 'wide'
# if I want to see all months I leave out the filter and
# add a group_by month
specificreportsLocal %>%
group_by(team, month) %>%
summarize(total_value = sum(value)) %>%
head(5) # this just shows the first 5 values
#> # A tibble: 5 x 3
#> # Groups: team [1]
#> team month total_value
#> <fct> <int> <int>
#> 1 a 1 17
#> 2 a 2 46
#> 3 a 3 91
#> 4 a 4 69
#> 5 a 5 83
# to make this 'long' data 'wide', we can use the `spread` function
specificreportsLocal %>%
group_by(team, month) %>%
summarize(total_value = sum(value)) %>%
spread(team, total_value)
#> # A tibble: 12 x 6
#> month a b c d e
#> <int> <int> <int> <int> <int> <int>
#> 1 1 17 122 136 NA 167
#> 2 2 46 104 158 94 197
#> 3 3 91 NA NA NA 11
#> 4 4 69 120 159 76 98
#> 5 5 83 186 158 19 208
#> 6 6 103 NA 118 105 84
#> 7 7 NA NA 73 127 107
#> 8 8 NA 130 NA 166 99
#> 9 9 125 72 118 135 71
#> 10 10 119 172 67 229 NA
#> 11 11 107 81 NA 131 49
#> 12 12 174 87 39 NA 41
Created on 2018-12-01 by the reprex package (v0.2.1)
Now I'm not really sure if that's what you want. So feel free to make a comment on this answer if you need any of this clarified.
Welcome to Stack Overflow!
I'm not sure I correctly understand your need without a data sample, but this may work for you:
library(rpivotTable)
specificreportsLocal %>% filter(Month == "October")
rpivotTable(specificreportsLocal, rows="Employee Team", cols="page", vals="views", aggregatorName = "Sum")
Otherwise, if you do not need it interactive (as the Pivot Tables in Excel), this may work as well:
specificreportsLocal %>% filter(Month == "October") %>%
group_by_at(c("Employee Team", "page")) %>%
summarise(nr_views = sum(views, na.rm=TRUE))

Joins in R while also spreading out information from one data frame

I am attempting to join together two data frames. One contains records of when certain events happened. The other contains daily information on values that occurred for a given organization.
My current challenge is how to join together the information in the "when certain events happened" data frame fully into the records data frame. Most of dplyr's joins appear to simply join one line together. I need to fully spread out the record information based on start and end dates.
In other words, I need to spread out information from one line into many lines, while simultaneously joining to the daily data table. It is important that I do this in R because the alternative is quite a bit of filtering and dragging in Excel (the information covers thousands of rows).
Below is a representation of the daily data table
value year month day org link
12 1 1 1 AA AA-1-1
45 1 1 2 AA AA-1-2
31 1 1 3 AA AA-1-3
10 1 1 4 AA AA-1-4
Below is a representation of the records table
year month day org link end_link event event_info
1 1 2 AA AA-1-1-2 AA-1-1-3 Buy Yes
1 2 7 BB BB-1-2-7 BB-1-2-10 Sell Yes
And finally, here is what I am aiming for in the end:
value month day org link event event_info
12 1 1 AA AA-1-1-1
45 1 2 AA AA-1-1-2 Buy Yes
31 1 3 AA AA-1-1-3 Buy Yes
10 1 4 AA AA-1-1-4
Is there any way to accomplish this in R? I have tried using dplyr joins but usually am only able to join together the initial link.
Edit: The second "end" link refers to an end date. In the records table this is all in one line, while the second data frame has daily information.
Edit: Below I have put together a cleaner look at my real data. The first image is of DAILY DATA while the second is of RECORDS OF EVENTS. The third is what I would like to see (ideally).
Daily data, which will have multiple orgs present
Records data, note org id AA and the audience
Ideal combined data
We have first to build some dates in order to build date sequences that we'll unnest to get a long version of df2, which we right join on df1:
library(tidyverse)
df2 %>%
separate(link,c("org1","year1","month1","day1")) %>%
separate(end_link,c("org2","year2","month2","day2")) %>%
rowwise %>%
transmute(org,event,event_info, date = list(
as.Date(paste0(year1,"-",month1,"-",day1)):as.Date(paste0(year2,"-",month2,"-",day2)))) %>%
unnest %>%
right_join(df1 %>% mutate(date=as.numeric(as.Date(paste0(year,"-",month,"-",day))))) %>%
select(value, month, day, org, link, event,event_info)
# # A tibble: 4 x 7
# value month day org link event event_info
# <int> <int> <int> <chr> <chr> <chr> <chr>
# 1 12 1 1 AA AA-1-1 <NA> <NA>
# 2 45 1 2 AA AA-1-2 Buy Yes
# 3 31 1 3 AA AA-1-3 Buy Yes
# 4 10 1 4 AA AA-1-4 <NA> <NA>
data
df1 <- read.table(text="value year month day org link
12 1 1 1 AA AA-1-1
45 1 1 2 AA AA-1-2
31 1 1 3 AA AA-1-3
10 1 1 4 AA AA-1-4",h=T,strin=F)
df2 <- read.table(text="year month day org link end_link event event_info
1 1 2 AA AA-1-1-2 AA-1-1-3 Buy Yes
1 2 7 BB BB-1-2-7 BB-1-2-10 Sell Yes",h=T,strin=F)
I would use the Data table package, it is for me the best R package to do data analysis. Hope to have properly understood the problem, let me know if it does not work.
The first part creates the data-set (I created the two data.table objects in two different ways just to show both alternatives, you could read your data directly from excel, .txt, .csv or similar, let me know if you want to know how to do this).
library(data.table)
value<-c(12,45,31,10)
year<-c(1,1,1,1)
month<-c(1,1,1,1)
day<-c(1,2,3,4)
org<-c("AA","AA","AA","AA")
link<-c("AA-1-1","AA-1-2","AA-1-3","AA-1-4")
Daily_dt<-data.table(value, year,month,day,org,link)
Records_dt<-data.table(year=c(1,1),month=c(1,1),day=c(2,3),org=c("AA","BB"),link=c("AA-1-1-2","BB-1-2-7"),end_link=c("AA-1-1-3","BB-1-2-10"),
event=c("Buy","Buy"),event_info=c("Yes","Yes"))
Daily_dt[,Date:=as.Date(paste(year,"-",month,"-",day,sep=""))]
To achieve what you want you need these lines
Records_dt=rbind(Records_dt[,c("org","link","event","event_info")],
Records_dt[,list(org,link=end_link,event,event_info)])
Record_Dates<-as.data.table(tstrsplit(Records_dt$link,"-")[-1])
Record_Dates[,Dates:=as.Date(paste(V1,"-",V2,"-",V3,sep=""))]
Records_dt[,Date:=Record_Dates$Dates]
setkey(Records_dt,Date)
setkey(Daily_dt,Date)
Records_dt<-Records_dt[,c("Date","event","event_info")][Daily_dt,]
Records_dt<-Records_dt[,c("value","month","day","org","link","event","event_info")]
and this is the result
> Records_dt
value month day org link event event_info
1: 12 1 1 AA AA-1-1 NA NA
2: 45 1 2 AA AA-1-2 Buy Yes
3: 31 1 3 AA AA-1-3 Buy Yes
4: 10 1 4 AA AA-1-4 NA NA
If your input data had more than one event in the same day (with or without the same org) something like:
> Records_dt
year month day org link end_link event event_info
1: 1 1 2 AA AA-1-1-2 AA-1-1-3 Buy Yes
2: 1 1 3 BB BB-1-2-7 BB-1-2-10 Buy Yes
3: 1 1 2 AA AA-1-1-2 AA-1-1-3 Buy Yes
4: 1 1 3 AA AA-1-2-7 AA-1-2-10 Buy Yes
some tweaks may be required, but am not sure if you required this, so did not add it.

From panel data to cross-sectional data using averages

I am very new to R so I am not sure how basic my question is, but I am stuck at the following point.
I have data that has a panel structure, similar to this
Country Year Outcome Country-characteristic
A 1990 10 40
A 1991 12 40
A 1992 14 40
B 1991 10 60
B 1992 12 60
For some reason I need to put this in a cross-sectional structure such I get averages over all years for each country, that is in the end, it should look like,
Country Outcome Country-Characteristic
A 12 40
B 11 60
Has anybody faced a similar problem? I was playing with lapply(table$country, table$outcome, mean) but that did not work as I wanted it.
Two tips: 1- When you ask a question, you should provide a reproducible example for the data too (as I did with read.table below). 2- It's not a good idea to use "-" in column names. You should use "_" instead.
You can get a summary using the dplyr package:
df1 <- read.table(text="Country Year Outcome Countrycharacteristic
A 1990 10 40
A 1991 12 40
A 1992 14 40
B 1991 10 60
B 1992 12 60", header=TRUE, stringsAsFactors=FALSE)
library(dplyr)
df1 %>%
group_by(Country) %>%
summarize(Outcome=mean(Outcome),Countrycharacteristic=mean(Countrycharacteristic))
# A tibble: 2 x 3
Country Outcome Countrycharacteristic
<chr> <dbl> <dbl>
1 A 12 40
2 B 11 60
We can do this in base R with aggregate
aggregate(.~Country, df1[-2], mean)
# Country Outcome Countrycharacteristic
#1 A 12 40
#2 B 11 60

Resources