I am new to R and really trying to wrap my head around everything (even taking online course--which so far has not helped at all).
What I started with is a large data frame containing 97 variables pertaining to compliance with regulations.
I have created multiple dataframes based on the various geographic locations (there is probably an easier way to do it).
In each of these dataframes, I have 7 variables I would like to find the mean of "Yes" and "No" responses.
I first tried:
summary(urban$vio_bag)
Length Class Mode
398 character character
However, this just tells me nothing useful except that I have 398 responses.
So I put this into a table:
urbanbag<-table(urban$vio_bag)
This at least provided me with the number of Yes and No responses
Var1 Freq
1 No 365
2 Yes 30
So I then converted to a data.frame:
urbanbag = as.data.frame(urbanbag)
Then viewed it:
summary(urbanbag)
Var1 Freq
No :1 Min. : 30.0
Yes:1 1st Qu.:113.8
Median :197.5
Mean :197.5
3rd Qu.:281.2
Max. :365.0
And the output still definitely did not help.. much more useless actually.
I am not building these Matrices in R. It is a table imported from excel.
I am just so lost and frustrated having spent days trying to figure out something that seems so elementary and googling help which did not work out.
Is there a way to actually do this?
We can use prop.table to get the proportion
v1 <- prop.table(table(urban$vio_bag))
then use barplot to plot it
barplot(v1)
Try with dplyr's n() (perfomrs counts) within sumarisse()
library(dplyr)
data %>% group_by(yes_no_column) %>% summarise(my_counts = n())
This will give you the counts you're looking for. Adjust the group_by() variables as needed -multiple variables can be used at the time for grouping purposes. Just like with n(), a function such as mean and sd can be passed to summarise. If you want to make a column out of each calculated metric, use mutate()
Oscar.
prop.table is a useful way of doing this. You can also solve this using mean:
mean(urban$vio_bag == "Yes")
mean(urban$vio_bag == "No")
Related
I am analysing a longitudinal survey (https://microdata.worldbank.org/index.php/catalog/3712) with around 2k participating households (dropping with each round). There were 11 waves/rounds, each divided into around 6-8 datasets based on theme of the questions. To analyse it, I need it in proper panel data format, with each theme being in one file, combining all the waves.
Please see the excel snippets (with most columns removed for simplicity) of how it looks: Round 1 vs round 9 (The levels of categorical variables have change names, from full names to just numbers but it's the same question). Basically, the format looks something like this:
head(round1.csv)
ID
INCOME SOURCE
ANSWER
CHANGE
101
1.Business
1. YES
3. Reduced
101
2.Pension
2. NO
102
1.Business
1. YES
2. No change
102
2. Assistance
1. YES
3. Reduced
So far I have only been analysing seperate waves by their own, but I do not know how to:
Combine so many data frames together.
Convert it to the format where each ID appears only once per wave. I used spread to use modelling in single files. I think I can imagine what the data frame would look like if the question was only whether they receive the income source (maybe like this?:
WAVE
ID
Business
Pension
:1
101
1. YES
1. NO
:1
102
1. YES
1. YES
:2
101
1. NO
1. YES
:2
102
1. YES
1. YES
), but I do not understand how it is supposted to look like with also the change to that income included.
How to deal with weights - there are weights added to one of the files for each wave. Some are missing, and they change per wave, as fewer households agree to participate each round.
I am happy to filter and only use houesholds that participated in every round, to make it easier.
I looked for an aswer here: Panel data, from wide to long with multiple variables and Transpose a wide dataset into long with multiple steps but I think my problem is different.
I am a student, definitely not an expert, so I apologise for my skill-level.
There's too many questions at once, I'll ignore the weights (it should be a separate question, after the merging is resolved).
How to merge? For sure you'll be doing something called a left join. The leftmost dataset should be the longest one (the first wave). The others will be joined by ID, and the IDs missing in the next ones will get NAs instead of the values. I'll be using tidyverse code examples - left_join docs here`.
You'll have to deal with a few things on the way.
duplicate column names
you can use the suffix argument like suffix = c(".wave1", ".wave2")
different coding of the data (seen in your samples eg s7q1 1. YES vs 1)
use something like extract() to get the same representation
When you're done with the joining, you need to re-shape your data. That would be something like pivot_longer(), followed by extract() to get the .wave# suffix into a separate column. Then you can pivot_wider() back into a wider format, keeping your wave column.
R-like pseudocode, illustrates how it could be done .. does not work (as I don't have your datasets):
library(tidyverse)
library(readxl)
read_excel("wave1.xlsx") -> d_w1
read_excel("wave2.xlsx") -> d_w2
d_w1 %>%
extract(s7q1, into = "s7q1", regex = "([0-9]+)") %>%
d_w1fix
d_w1fix %>%
left_join(d_w2, by = "ID", suffix = c(".wave1", ".wave2")) %>%
pivot_longer(-ID, names_to = "question", values_to = "answer") %>%
extract(question, into = c("question", "wave"), regex = "([[:alnum:]]+).wave([0-9])") %>%
pivot_wider(names_from = "question", values_from = "answer") ->
d_final
I am trying to calculate the weighted average by group in R, but it is only returning the weighted average of the whole dataset and I haven't been able to determine where my issue is occurring. Below is my code. Note, in the weighted.mean function, if I do not specify the data frame name for the column type, nothing is returned, so not sure if the way I am referencing the data is causing the issue.
unit_averages = selected_units %>%
group_by(`Length x Width`,Date) %>%
summarise(index_mean = weighted.mean(selected_units$"Wtd Avg Price",w=selected_units$"Unit Count"))
#akrun provided the answer, but they posted as a comment and not an answer, so posting this to close out the inquiry.
Remove the selected_units$ and use backquotes for the column names with spaces –
akrun
Jun 10 at 18:03
I'm just 2 days into R so I hope I can give enough Info on my problem.
I have an Excel Table on Endothelial Cell Angiogenesis with Technical Repeats on 4 different dates. (But those Dates are not in order and in different weeks)
My Data looks like this (of course its not only the 2nd of March):
I want to average the data on those 4 different days, so I can compare i.e the "Nb Nodes" from day 1 to day 4.
So to finally have a jitterplot containing the group, the investigated Data Point and the date.
I'm a medical student so I dont really have yet any knowledge about this kind of stuff but Im trying to learn it. Hopefully I provided enough Info!
Found the solution:
#Group by
library(dplyr)
DateGroup <- group_by(Exclude0, Exp.Date, Group)
#Summarizing the mean in every Group and Date
summarise(DateGroup, mymean = mean(Date$`Nb meshes`))
I think the below code will work.
group_by the dimension you want to summarize by
2a. across() is helper verb so that you don't need to manually type each column specifically, it allows us to use tidy select language so that we can quickly reference columns that contains "Nb" (a pattern that I noticed from your screenshot)
2b. With across(), second argument, you then use formula that you want to apply to each column from the first argument of across()
2c. Optional argument in across so that the new columns names have a name convention)
Good luck on your R learning! It's a really great language and you made the right choice.
#df is your data frame
df %>% group_by(Exp.Date) %>%
summarize(across(contains("Nb"),mean,.names = {.fn}_{.col}))
#if you just want a single column then do this
df %>% group_by(Exp.Date) %>%
summarize(mean_nb_nodes=mean(`Nb nodes`))
I need to find the mean of a variable and the number of times a particular combination occurs for that mean value in r.
In the example I have grouped by variables cli, cus and ron and need to summarize to find the mean of age and frequency of cash for this combination:
df%>% group_by(.dots=c("cli","cus","ron")) %>% summarise_all(mean(age),length(cash))
This doesn't work; is there another way out?
may be it is just me as I seemed to have just over complicated this one, just summarise gets me what I needed
df%>% group_by(.dots=c("cli","cus","ron")) %>% summarise(mean(age),length(cash))
So I have data imported to r using data=read.delim("clipboard")
This is the last sections of the data....so I decided to use data2=na.omit(data,method="linear") which gave me this result...
But as you can see I have lost data from 290 to 293 ....for the 3rd and 4th column....pls help remove those NA values without losing data from the other columns...The data I have given you represents time and speed data...and what I'm trying to do is find the average speed every 100s etc...using a code pointed out to me earlier in my previous questions which is in this link...h
Keep the na values as they are but use na.rm in your subsequent manipulations; eg, sum(df[,1], na.rm = TRUE) where df is your data frame.