Related
I have a data frame that contains 5000 examinee's ability estimation with their test score, and they are both continuous variables. Since there are too many examinees, it would be messy to plot out all their scores, so I wish to draw a 'broken line plot' or 'conditional mean plot', that average the test scores of several examines that have similar ability levels at a time, and plot their average score against their average ability. Like the plot below.
I already managed to do this with the codes below.
df<-cbind(rnorm(100,set.seed(123)),sample(100,set.seed(123)),) %>%
as.data.frame() %>%
setNames(c("ability","score")) #simulate the dataset
df<-df[order(df$ability),] #sort the data from low to high according to the ability varaible
seq<-round(seq(from=1, to=nrow(df), length.out=10),0) #divide the data equally to nine groups (which is also gonna be the 9 points that appear in my plot)
b<-data.frame()
for (i in 1:9) {
b[i,1]<-mean(df[seq[i]:seq[i+1],1]) #calculate the mean of the ability by group
b[i,2]<-mean(df[seq[i]:seq[i+1],2]) # calculate the mean of test score by group
}.
I got the mean of the ability and test score using this for loop, and it looks like this
and finally, do the plot
plot(b$V1,b$V2, type='b',
xlab="ability",
ylab="score",
main="Conditional score")
These codes meet my goal, but I can't help thinking if there's a simpler way to do this. Drawing a broken line plot by averaging the data that is sorted from low to high seems to be a normal task.
I wonder if there is any function or trick for this. All ideas are welcome! :)
Here is a solution to create the data to be plotted using dplyr:
set.seed(123)
df<-cbind(rnorm(100,1),sample(100,50)) %>%
as.data.frame() %>%
setNames(c("ability","score")) #simulate the dataset
df<-df[order(df$ability),] #sort the data from low to high according to the ability varaible
df$id <- seq(1, nrow(df))
df %>% mutate(bin = ntile(id, 10)) %>%
group_by(bin) %>%
dplyr::summarize(meanAbility = mean(ability, na.rm=T),
meanScore = mean(score, na.rm=T)) %>%
as.data.frame()
bin meanAbility meanScore
1 1 -0.81312770 41.6
2 2 -0.09354171 52.3
3 3 0.29089892 54.4
4 4 0.68490709 45.8
5 5 0.93078744 59.8
6 6 1.17380069 34.0
7 7 1.42942368 41.3
8 8 1.64965315 40.1
9 9 1.95290596 35.6
10 10 2.50277510 52.9
I would approach the whole thing a bit differently (note also that your code has several errors and won't run the way you were showing.
The exmaple below will lead to different numbers than yours (due to the random generation of numbers and your non-working code).
library(tidyverse)
df <- data.frame(ability = rnorm(100),
score = sample(100)) %>%
arrange(ability) %>%
mutate(seq = ntile(n = 9)) %>%
group_by(seq) %>%
summarize(mean_ability = mean(ability),
mean_score = mean(score))
which gives:
# A tibble: 9 x 3
seq mean_ability mean_score
<int> <dbl> <dbl>
1 1 -1.390807 45.25
2 2 -0.7241746 56.18182
3 3 -0.4315872 49
4 4 -0.2223723 48.81818
5 5 0.06313174 56.36364
6 6 0.3391321 42
7 7 0.6118022 53.27273
8 8 1.021438 50.54545
9 9 1.681746 53.54545
Looking for advice on refining my code and also trimming to a date range.
The spreadsheet itself is pulled from another system and so the structure of the excel cannot be changed. When you pull the data it basically starts at E2, with the first date column in F2, and the first item in E3. The data will continue to populate to the right for as long as it goes on for. I have replicated the structure below.
AndI want it to look like:
I have come up with the below, which works, but I was looking for advice on refining it down to fewer individual step by steps.
In the below code:
= extracting data
= pulling the dates out
= formatting from
excel number to an actual date
= grabbing the item names
= transposing data and skipping some parts
= adding in dates to the row names
#1
df <- data.frame(read_excel("C:/example.xlsx",
sheet = "Sheet1"))
#2
dfdate <- gtb[1, -c(1,2,3,4,5)]
#3
dfdate <- format(as.Date(as.numeric(dfdate),
origin = "1899-12-30"), "%d/%m/%Y")
#4
rownames(gtb) <- gtb[,1]
#5
gtb <- as.data.frame(t(gtb[, -c(1,2,3,4,5)]))
#6
rownames(gtb) <- dfdate
After the row names have been added the structure is such that I am happy to start creating the visuals where needed.
thanks for your advice
David
Here is one suggestion, I don't really have easy access to your data, but I am including code to remove those columns as you do, based on their names, which can be nicer than removing by index.
df <- read.table( text=
"Item_Code 01/01/2018 01/02/2018 01/03/2018 01/04/2018
Item 99 51 60 69
Item2 42 47 88 2
Item3 36 81 42 48
",header=TRUE, check.names=FALSE) %>%
rename( `Item Code` = Item_Code )
library(tibble)
library(lubridate)
x <- df %>% select( -matches("Code \\d|Internal Code") ) %>%
column_to_rownames("Item Code") %>%
t %>% as.data.frame %>%
rownames_to_column("Item Code") %>%
mutate( `Item Code` = dmy(`Item Code`) )
x
Output:
> x
Item Code Item Item2 Item3
1 2018-01-01 99 42 36
2 2018-02-01 51 47 81
3 2018-03-01 60 88 42
4 2018-04-01 69 2 48
I went a bit forth and back with this solution, but it can be nice to also showcase how to remove columns by a regex on their column names, since you are removing several similarly named columns.
The t trick, that you also use, works becuase there is really only one more column there that would cause problems with this, as others have commented, and this can be temporarily stowed away as rownames. If that weren't the case, you're looking at a more complex solution involving pivot_wider and pivot_longer or splitting the data.frame and transposing only one of the halves.
I'm working with R for the first time for a class in college. To preface this: I don't know enough to know what I don't know, so I'm sorry if this question has been asked before. I am trying to predict the results of the Texas state house elections in 2020, and I think the best prior for that is the results of the 2018 state house elections. There are 150 races, so I can't bare to input them all by hand, but I can't find any spreadsheet that has data formatted how I want it. I want it in a pretty standard table format:
My desired table format. However, the table from the Secretary of state I have looks like the following:
Gross ugly table.
I wrote some psuedo code:
Here's the Psuedo Code, basically we want to construct a new CSV:
'''%First, we want to find a district, the house races are always preceded by a line of dashes, so I will need a function like this:
Create a New CSV;
for(x=1; x<151 ; x +=1){
Assign x to the cell under the district number cloumn;
Find "---------------" ;
Go down one line;
Go over two lines;
% We should now be in the third column and now want to read in which party got how many votes. The number of parties is not consistant, so we need to account for uncontested races, libertarians, greens, and write ins. I want totals for Republicans, Democrats, and Other.
while(cell is not empty){
Party <- function which reads cell (but I want to read a string);
go right one column;
Votes <- function which reads cell (but I want to read an integer);
if(Party = Rep){
put this data in place in new CSV;
else if (Party = Dem)
put this data in place in new CSV;
else
OtherVote += Votes;
};
};
Assign OtherVote to the column for other party;
OtherVote <- 0;
%Now I want to assign 0 to null cells (ones where no rep, or no Dem, or no other party contested
read through single row 4 spaces, if its null assign it 0;
Party <- null
};'''
But I don't know enough to google what to do! Here's what I need help with: Can I create a new CSV in Rstudio, how? How can I read specific cells in a table, hopefully indexing? Lastly, how do I write to a table in R. Any help is appreciated! Thank you!
Can I create a new CSV in Rstudio, how?
Yes you can. Use the "write.csv" function.
write.csv(df, file = "df.csv") #see help for more information.
How can I read specific cells in a table?
Use the brackets after df,example below.
df <- data.frame(x = c(1,2,3), y = c("A","B","C"), z = c(15,25,35))
df[1,1]
#[1] 1
df[1,1:2]
# x y
#1 1 A
How do I write to a table in R?
If you want to write a table in xlsx use the function write.xlsx from openxlsx package.
Wikipedia seems to have a table that is closer to the format you are looking for.
In order to get to the table you are looking for we need a few steps:
Download data from Wikipedia and extract table.
Clean up table.
Select columns.
Calculate margins.
1. Download data from wikipedia and extract table.
The rvest table helps with downloading and parsing websites into R objects.
First we download the HTML of the whole website.
library(dplyr)
library(rvest)
wiki_html <-
read_html(
"https://en.wikipedia.org/wiki/2018_United_States_House_of_Representatives_elections_in_Texas"
)
There are a few ways to get a specific object from an HTML file in this case
I dedided to look for the table that has the class name “wikitable plainrowheaders sortable”,
as I learned from inspecting the code, that the only table with that class is
the one we want to extract.
library(purrr)
html_nodes(wiki_html, "table") %>%
map_lgl( ~ html_attr(., "class") == "wikitable plainrowheaders sortable") %>%
which()
#> [1] 20
Then we can select table number 20 and convert it to a dataframe with html_table()
raw_table <-
html_nodes(wiki_html, "table")[[20]] %>%
html_table(fill = TRUE)
2. Clean up table.
The table has duplicated names, we can change that by using as_tibble() and its .name_repair argument. We then usedplyr::select() to get the columns. Furthermore we usedplyr::filter() to delete the first two rows, that have "District" as a value in theDistrictcolumn. Now the columns are still characters
vectors, but we need them to be numeric, therefore we first delete commas from
all columns and then transform columns 2 to 4 to numeric.
clean_table <-
raw_table %>%
as_tibble(.name_repair = "unique") %>%
filter(District != "District") %>%
mutate_all( ~ gsub(",", "", .)) %>%
mutate_at(2:4, as.numeric)
3. Select columns and 4. Calculate margins.
We use dplyr::select() to select the columns you are interested in and give them more helpful names.
Finally we calculate the margin between democratic and republican votes by first adding up there votes
as total_votes and then dividing the difference by total_votes.
clean_table %>%
select(District,
RepVote = Republican...2,
DemVote = Democratic...4,
OthVote = Others...6) %>%
mutate(
total_votes = RepVote + DemVote,
margin = abs(RepVote - DemVote) / total_votes * 100
)
#> # A tibble: 37 x 6
#> District RepVote DemVote OthVote total_votes margin
#> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 District 1 168165 61263 3292 229428 46.6
#> 2 District 2 139188 119992 4212 259180 7.41
#> 3 District 3 169520 138234 4604 307754 10.2
#> 4 District 4 188667 57400 3178 246067 53.3
#> 5 District 5 130617 78666 224 209283 24.8
#> 6 District 6 135961 116350 3731 252311 7.77
#> 7 District 7 115642 127959 0 243601 5.06
#> 8 District 8 200619 67930 4621 268549 49.4
#> 9 District 9 0 136256 16745 136256 100
#> 10 District 10 157166 144034 6627 301200 4.36
#> # … with 27 more rows
Edit: In case you want to go with the data provided by the state, it looks to me as if the data you are looking for is in the first, third and fourth column. So what you want to do is.
(All the code below is not tested, as I do not have the original data.)
read data into R
library(readr)
tx18 <- read_csv("filename.csv")
select relevant columns
tx18 <- tx18 %>%
select(c(1,3,4))
clean table
tx18 <- tx18 %>%
filter(!is.na(X3),
X3 != "Party",
X3 != "Race Total")
Group and summarize data by party
tx18 <- tx18 %>%
group_by(X3) %>%
summarise(votes = sum(X3))
Pivot/ Reshape data to wide format
tx18 %>$
pivot_wider(names_from = X3,
values_from = votes)
After this you could then calculate the margin similarly as I did with the Wikipedia data.
I'm new to R programming, so this question might be simple.
Anyway, I've tryed to find some answer to this specific thing I'm trying to do and didnt get it.
So, Im trying to import new data I've got to my old data.frame.
The problem is that this data has to substitute previous NA values in variables that already exist.
Also my data have different individuals (companys) in different periods (years), and my new data set only have the companys and years that was missing, plus some observation that I already had.
I tryied to simulate the problem with the data frames below:
Data frame with NAs:
df1 <- data.frame( company = c(rep("A",3), rep("B",3), rep("C",3)),
year = c(rep(2016:2018,each=1)),
income = c(95,87,93,NA,NA,58,102,80,NA),
debt = c(43,50,51,NA,37,37,53,NA,NA),
stringsAsFactors= F )
To search for new data, I created a data set with only the missing data, as my data had to many observations:
df_NA <- data.frame(df1[is.na(df1$income & df1$debt),])
So after searching, I was able to find the missing data, and now I have something like this:
df2 <- data.frame( company = c("A", "B" , "C" , "C"),
year = c(2018, 2016, 2017, 2018),
income = c(60,55, 80, 82),
debt = c(32,37, 53,48),
stringsAsFactors= F )
Now, I'm trying to get this data together, so I have the complete data.frame to work.
The problem is that I couldnt find a way to do it yet. I've tryed merge and join, indexing for company and year, but the variables that have the same name in both data.frame get duplicated and a suffix.
In my data I have much more observations and variables to fill, so I want to find a way I can do it with a command. Also this is going to happen again in the future, so it will be very helpfull.
I'm sorry if this was already answered. Thank you!
Here is an option using data.table:
library(data.table)
setDT(df1)
setDT(df2)
df1[df2, on=c("company", "year"), c('income', 'debt') := { list(i.income, i.debt)}]
# company year income debt
#1: A 2016 95 43
#2: A 2017 87 50
#3: A 2018 60 32
#4: B 2016 55 37
#5: B 2017 NA 37
#6: B 2018 58 37
#7: C 2016 102 53
#8: C 2017 80 53
#9: C 2018 82 48
Or another option using dplyr
library(dplyr)
full_join(df1, df2, by = c("year", "company")) %>%
mutate(
income = coalesce(income.x, income.y),
debt= coalesce(debt.x, debt.y),
) %>%
select(company, year, income, debt)
I got an XLSX with data from a questionnaire for my master thesis.
The questions and answers for an interviewee are in one row in the second column. The first column contains the date.
The data of the second column comes in a form like this:
"age":"52","height":"170","Gender":"Female",...and so on
I started with:
test12 <- read_xlsx("Testdaten.xlsx")
library(splitstackshape)
test13 <- concat.split(data = test12, split.col= "age", sep =",")
Then I got the questions and the answers as a column divided by a ":".
For e.g. column 1: "age":"52" and column2:"height":"170".
But the data is so messy that sometimes in the column of the age question and answer there is a height question and answer and for some questionnaires questions and answers double.
I would need the questions as variables and the answers as observations. But I have no clue how to get there. I could clean the data in excel first, but with the fact that columns are not constant and there are for e.g. some height questions in the age column I see no chance to do it as I will get new data regularly, formated the same way.
Here is an example of the data:
A tibble: 5 x 2
partner.createdAt partner.wphg.info
<chr> <chr>
1 2019-11-09T12:13:11.099Z "{\"age_years\":\"50\",\"job_des\":\"unemployed\",\"height_cm\":\"170\",\"Gender\":\"female\",\"born_in\":\"Italy\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"200000\""
2 2019-11-01T06:43:22.581Z "{\"age_years\":\"34\",\"job_des\":\"self-employed\",\"height_cm\":\"158\",\"Gender\":\"male\",\"born_in\":\"Germany\",\"Alcoholic\":\"true\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"10000\""
3 2019-11-10T07:59:46.136Z "{\"age_years\":\"24\",\"height_cm\":\"187\",\"Gender\":\"male\",\"born_in\":\"England\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"150000\""
4 2019-11-11T13:01:48.488Z "{\"age_years\":\"59\",\"job_des\":\"employed\",\"height_cm\":\"167\",\"Gender\":\"female\",\"born_in\":\"United States\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"2\",\"total_wealth\":\"1000000~
5 2019-11-08T14:54:26.654Z "{\"age_years\":\"36\",\"height_cm\":\"180\",\"born_in\":\"Germany\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"170000\",\"job_des\":\"employed\",\"Gender\":\"male\""
Thank you so much for your time!
You can loop through each entry, splitting at , as you did. Then you can loop through them all again, splitting at :.
The result will be a bunch of variable/value pairings. This can be all done stacked. Then you just want to pivot back into columns.
data
Updated the data based on your edit.
data <- tribble(~partner.createdAt, ~partner.wphg.info,
'2019-11-09T12:13:11.099Z', '{\"age_years\":\"50\",\"job_des\":\"unemployed\",\"height_cm\":\"170\",\"Gender\":\"female\",\"born_in\":\"Italy\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"200000\"',
'2019-11-01T06:43:22.581Z', '{\"age_years\":\"34\",\"job_des\":\"self-employed\",\"height_cm\":\"158\",\"Gender\":\"male\",\"born_in\":\"Germany\",\"Alcoholic\":\"true\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"10000\"',
'2019-11-10T07:59:46.136Z', '{\"age_years\":\"24\",\"height_cm\":\"187\",\"Gender\":\"male\",\"born_in\":\"England\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"150000\"',
'2019-11-11T13:01:48.488Z', '{\"age_years\":\"59\",\"job_des\":\"employed\",\"height_cm\":\"167\",\"Gender\":\"female\",\"born_in\":\"United States\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"2\",\"total_wealth\":\"1000000\"',
'2019-11-08T14:54:26.654Z', '{\"age_years\":\"36\",\"height_cm\":\"180\",\"born_in\":\"Germany\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"170000\",\"job_des\":\"employed\",\"Gender\":\"male\"')
libraries
We need a few here. Or you can just call tidyverse.
library(stringr)
library(purrr)
library(dplyr)
library(tibble)
library(tidyr)
function
This function will create a data frame (or tibble) for each question. The first column is the date, the second is the variable, the third is the value.
clean_record <- function(date, text) {
clean_records <- str_split(text, pattern = ",", simplify = TRUE) %>%
str_remove_all(pattern = "\\\"") %>% # remove double quote
str_remove_all(pattern = "\\{|\\}") %>% # remove curly brackets
str_split(pattern = ":", simplify = TRUE)
tibble(date = as.Date(date), variable = clean_records[,1], value = clean_records[,2])
}
iteration
Now we use pmap_dfr from purrr to loop over the rows, outputting each row with an id variable named record.
This will stack the data as described in the function. The mutate() line converts all variable names to lowercase. The distinct() line will filter out rows that are exact duplicates.
What we do then is just pivot on the variable column. Of course, replace data with whatever you name your data frame.
data_clean <- pmap_dfr(data, ~ clean_record(..1, ..2), .id = "record") %>%
mutate(variable = tolower(variable)) %>%
distinct() %>%
pivot_wider(names_from = variable, values_from = value)
result
The result is something like this. Note how I had reordered some of the columns, but it still works. You are probably not done just yet. All columns are now of type character. You need to figure out the desired type for each and convert.
# A tibble: 5 x 10
record date age_years job_des height_cm gender born_in alcoholic knowledge_selfass total_wealth
<chr> <date> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 2019-11-09 50 unemployed 170 female Italy false 5 200000
2 2 2019-11-01 34 self-employed 158 male Germany true 3 10000
3 3 2019-11-10 24 NA 187 male England false 3 150000
4 4 2019-11-11 59 employed 167 female United States false 2 1000000
5 5 2019-11-08 36 employed 180 male Germany false 5 170000
For example, convert age_years to numeric.
data_clean %>%
mutate(age_years = as.numeric(age_years))
I am sure you may run into other things, but this should be a start.