Looping through csv and dumping to data frame in R - r

Using R, I am trying to take a csv file, loop through it, extract values, and dump them into a data frame. There are four columns in the csv: ID, UG_inst, Freq, and Year. Specifically, I want to loop through the UG_inst column by institution name for each year (2010-11,2011-12,2012-13,and 2013-14) and put the value at that cell into the respective "cell" in the R data frame. Right now, the csv just has a Year column, but the data frame I've created has a column for each year. The ultimate idea is to be able to create bar graphs representing the frequency per institution per year. Currently, the code below throws up NO errors, but appears to do nothing to the R data frame "j".
A couple of caveats: 1) Doing a nested for loop was making my head spin, so I decided to just use 2010-11 for now and just loop through the institution name. Since there are only 4 years, I can rewrite this four times, each time with a different year. 2) Also, in the csv, there are repeat names. So, if an institution name appears twice (will be adjacent rows in the csv due to alphabetical arrangement), is there a way to dump the SUM of these into the data frame in R?
All relevant info is below. Thanks so much for any help!!!!
Here is a link to the .csv file: https://www.dropbox.com/s/9et7muchkrgtgz7/UG_inst_ALL.csv
And here is the R code I am trying:
abc <- read.csv(insert file path to above csv here)
inst_string <- unique(abc$UG_inst)
j <- data.frame("UG_inst"=inst_string,"2010-11"=NA,"2011-12"=NA,"2012-13"=NA,"2013-14"=NA)
for (i in inst_string) {
inst.index <- which(abc$UG_inst == i && abc$Year == "2010-11")
j$X2010.11[j$Ug_inst==i] <- abc$Freq[inst.index]
}

Instead of using a nested loop (or a loop at all) I suggest using the reshape() function in base R.
abc <- read.csv("UG_inst_ALL.csv")
abc <- abc[2:4]
reshape(data = abc,
v.names = "Freq",
timevar = "Year",
idvar = "UG_inst",
direction = "wide")

This is known as "reshaping" your data, and you are going from a "long" format to a "wide" format.
In addition to base R's reshape function, here are a few other options to consider.
I'll assume that we are starting with data read in like the following.
abc <- read.csv("~/Downloads/UG_inst_ALL.csv", row.names = 1)
head(abc)
# UG_inst Freq Year
# 1 Abilene Christian University 0 2010-11
# 2 Adams State University 0 2010-11
# 3 Adrian College 1 2010-11
# 4 Agnes Scott College 0 2010-11
# 5 Alabama A&M University 1 2010-11
# 6 Albion College 1 2010-11
Option 1: xtabs
out <- as.data.frame.matrix(xtabs(Freq ~ UG_inst + Year, abc))
head(out)
# 2010-11 2011-12 2012-13 2013-14
# Abilene Christian University 0 1 0 0
# Adams State University 0 0 0 1
# Adrian College 1 0 0 0
# Agnes Scott College 0 0 1 0
# Alabama A&M University 1 3 1 2
# Albion College 1 0 0 0
Option 2: dcast from "reshape2"
library(reshape2)
head(dcast(abc, UG_inst ~ Year, value.var = "Freq"))
Option 3: spread from "tidyr"
library(dplyr)
library(tidyr)
abc %>% select(-X) %>% group_by(UG_inst) %>% spread(Year, Freq)

Related

How to do aggregate sum by time range in R?

I have a dataframe as below:
**df**
Cust_name time freq
Andrew 0 4
Dillain 1 2
Alma 2 3
Andrew 1 4
Kiko 2 1
Sarah 2 8
Sarah 0 3
I want to calculate the sum of frequency by the time range provided for each cust_name. Example: If I select time range 0 to 2 for Andrew, it will give me sum of freq: 4+4= 8. And for Sarah, it will give me 8+3=11. I have tried it in the following ways just to get the time range, but do not know how to do the rest, as I am very new to R:
df[(df$time>=0 & df$time<=2),]
You can do this with dplyr.
To make your code reproducible, you should add the creation of your dataframe in your post. Copy and pasting everything is time consuming.
library(dplyr)
df <- data.frame(
cust_name = c('Andrew', 'Dillain', 'Alma', 'Andrew', 'Kiko', 'Sarah', 'Sarah'),
time = c(0,1,2,1,2,2,0),
freq = c(4,2,3,4,1,8,3)
)
df %>%
filter(time >=0, time <=2) %>%
group_by(cust_name) %>%
summarise(sum_freq = sum(freq))

Using R to redistribute character values across multiple columns in a dataset

I am a new R user with a survey dataset that asks respondents (indicated by their IDs 100,101,102, and 103 in rows) to list common crimes and security problems in their neighborhoods. Types of incidents, crime1-crime3, are distributed across the columns. I know that the dataset is not well organized, but this is the structure of the output generated by Google form surveys for "select all that apply" questions.
I would like to write R code to reconfigure the dataset so that each type of crime/problem (for example, theft) has its own column. Then the character values could be replaced with 1. I have reorganized a small excerpt of the larger dataset by hand to show the end result I am looking for. Any suggestions would be greatly appreciated!
I initially tried to use gather() to collect all character values into one column and then redistribute into new columns but was unable to get it to work.
Original dataset:
respondentID crime1 crime1 crime3
100 vandalism other 0
101 other 0 0
102 drugs theft other
103 drugs theft vandalism
Trying to convert to:
respondentID drugs theft vandalism other
100 0 0 1 1
101 0 0 0 1
102 1 1 0 1
103 1 1 1 0
First, convert from wide to long:
df <- structure(list(respondentID = 100:103, crime1 = c("vandalism",
"other", "drugs", "drugs"), crime1.1 = c("other", "0", "theft",
"theft"), crime3 = c("0", "0", "other", "vandalism")),
class = "data.frame", row.names = c(NA,
-4L))
df_long <- df %>%
gather(key="crime_no", value="crime", -respondentID) %>%
select(-crime_no) %>%
filter(crime != "0")
Explanation: first line loads your data (you should use dput next time you ask a question). Second line converts to a format with one column per crime. We do not need the "crime_no" column, as you are not interested whether a crime was 1, 2 or 3. Finally, we don't need the 0 crimes (we will fill that out automatically later).
Now, calculate the statistics:
df_stat <- df_long %>% group_by(respondentID, crime) %>% summarise(n=n())
Explanation: while in your example data each person committed a type of crime only once, I assume that this is not the general case, and some times you will see other numbers. We first group the data by respondent and crime, and next we count how many times each combination occurs.
And back to wide format:
df_wide <- df_stat %>% spread(key=crime, value=n, fill=0)
Explanation: we now convert to the "wide" format, where each crime has its own column. We use parameter fill=0, so if there is missing data (i.e., a person did not commit a given crime), then we will insert 0 rather than NA.
And here is your result:
respondentID drugs other theft vandalism
<int> <dbl> <dbl> <dbl> <dbl>
1 100 0 1 0 1
2 101 0 1 0 0
3 102 1 1 1 0
4 103 1 0 1 1
Next time you ask a question, please
use dput so we can easily load your example data
show some code: how did you try to tackle the problem on your own?

How can I group by one variable in terms of status of a different variable in a longitudinal situation in R?

I'm new to R, so please go easy on me... I have some longitudinal data that looks like
Basically, I'm trying to find a way to get a table with a) the number of unique cases that have all complete data and b) the number of unique cases that have at least one incomplete or missing data. The end results would ideally be
df<- df %>% group_by(Location)
df1<- df %>% group_by(any(Completion_status=='Incomplete' | 'Missing'))
Not sure about what you want, because it seems there are something of inconsistent between your request and the desired output, however lets try, it seems you need a kind of frequency table, that you can manage with basic R. At the bottom of the answer you can find some data similar to yours.
# You have two cases, the Complete, and the other, so here a new column about it:
data$case <- ifelse(data$Completion_status =='Complete','Complete', 'MorIn')
# now a frequency table about them: if you want a data.frame, here we go
result <- as.data.frame.matrix(table(data$Location,data$case))
# now the location as a new column rather than the rownames
result$Location <- rownames(result)
# and lastly a data.frame with the final results: note that you can change the names
# of the columns but if you want spaces maybe a tibble is better
result <- data.frame(Location = result$Location,
`Number.complete` = result$Complete,
`Number.incomplete.missing` = result$MorIn)
result
Location Number.complete Number.incomplete.missing
1 London 0 1
2 Los Angeles 0 1
3 Paris 3 1
4 Phoenix 0 2
5 Toronto 1 1
Or if you prefere a dplyr chain:
data %>%
mutate(case = ifelse(data$Completion_status =='Complete','Complete', 'MorIn')) %>%
do( as.data.frame.matrix(table(.$Location,.$case))) %>%
mutate(Location = rownames(.)) %>%
select(3,1,2) %>%
`colnames<-`(c("Location","Number of complete ", "Number of incomplete or"))
Location Number of complete Number of incomplete or
1 London 0 1
2 Los Angeles 0 1
3 Paris 3 1
4 Phoenix 0 2
5 Toronto 1 1
With data:
# here your data (next time try to put them in an usable way in the question)
data <- data.frame( ID = c("A1","A1","A2","A2","B1","C1","C2","D1","D2","E1"),
Location = c('Paris','Paris','Paris','Paris','London','Toronto','Toronto','Phoenix','Phoenix','Los Angeles'),
Completion_status = c('Complete','Complete','Incomplete','Complete','Incomplete','Missing',
'Complete','Incomplete','Incomplete','Missing'))

How to check for skipped values in a series in a R dataframe column?

I have a dataframe price1 in R that has four columns:
Name Week Price Rebate
Car 1 1 20000 500
Car 1 2 20000 400
Car 1 5 20000 400
---- -- ---- ---
Car 1 54 20400 450
There are ten Car names in all in price1, so the above is just to give an idea about the structure. Each car name should have 54 observations corresponding to 54 weeks. But, there are some weeks for which no observation exists (for e.g., Week 3 and 4 in the above case). For these missing weeks, I need to plug in information from another dataframe price2:
Name AveragePrice AverageRebate
Car 1 20000 500
Car 2 20000 400
Car 3 20000 400
---- ---- ---
Car 10 20400 450
So, I need to identify the missing week for each Car name in price1, capture the row corresponding to that Car name in price2, and insert the row in price1. I just can't wrap my head around a possible approach, so unfortunately I do not have a code snippet to share. Most of my search in SO is leading me to answers regarding handling missing values, which is not what I am looking for. Can someone help me out?
I am also indicating the desired output below:
Name Week Price Rebate
Car 1 1 20000 500
Car 1 2 20000 400
Car 1 3 20200 410
Car 1 4 20300 420
Car 1 5 20000 400
---- -- ---- ---
Car 1 54 20400 450
---- -- ---- ---
Car 10 54 21400 600
Note that the output now has Car 1 info for Week 4 and 5 which I should fetch from price2. Final output should contain 54 observations for each of the 10 car names, so total of 540 rows.
try this, good luck
library(data.table)
carNames <- paste('Car', 1:10)
df <- data.table(Name = rep(carNames, each = 54), Week = rep(1:54, times = 10))
df <- merge(df, price1, by = c('Name', 'Week'), all.x = TRUE)
df <- merge(df, price2, by = 'Name', all.x = TRUE); df[, `:=`(Price = ifelse(is.na(Price), AveragePrice, Price), Rebate = ifelse(is.na(Rebate), AverageRebate, Rebate))]
df[, 1:4]
So if I understand your problem correctly you basically have 2 dataframes and you want to make sure the dataframe - "price1" has the correct rownames(names of the cars) in the 'names' column?
Here's what I would do, but it probably isn't the optimal way:
#create a loop with length = number of rows in your frame
for(i in 1:nrow(price1)){
#check if the value is = NA,
if (is.na(price1[1,i] == TRUE){
#if it is NA, replace it with the corresponding value in price2
price1[1,i] <- price2[1,i]
}
}
Hope this helps (:
If I understand your question correctly, you only want to see what is in the 2nd table and not in the first. You will just want to use an anti_join. Note that the order you feed the tables into the anti_join matters.
library(tidyverse)
complete_table ->
price2 %>%
anti_join(price1)
To expand your first table to cover all 54 weeks use complete() or you can even fudge it and right_join a table that you will purposely build with all 54 weeks in it. Then anything that doesn't join to this second table gets an NA in that column.

Count by elements in a variable R

I am attempting to count elements within a variable(column) and group it by elements in another variable.The table below is the current data I am working with:
. Company.Name Sales.Team Product.Family
1 example1 Global Accounts FDS
2 example2 Americas RDS
3 Example3 WEMEA 2 Research
4 Example4 WEMEA 2 Research
5 Example5 CEE Research
6 Example6 CEE Research
What I am trying to do is is aggregate count of company names by different product families. So it would look something like:
FDS RDS Research
Americas 0 1 0
CEE 0 0 2
Global Accounts 1 0 0
WEMEA 2 0 0 2
I have been messing around with the aggregate function, but this has not yielded the needed data. I am having trouble with determining how to have columns based on elements in a row.
Any help would be appreciated.
You can solve this using the base table function in R. using an example table:
table(example_table$Sales.Team, example_table$Product.Family)
A basic run through for frequency tables can be found here at quick-R
If you need your output to be a dataframe, this is really easy using dplyr.
library(dplyr)
my_df <- data.frame("Product.Family" = c("FDS", "RDS", rep("Research", 4)), "Company.Name" = paste0("Example", 1:6), "Sales.Team" = c("Global Accounts", "Americas", rep("WEMA 2", 2), rep("CEE", 2)))
summary_df <- my_df %>%
group_by(Sales.Team) %>%
summarize(FDS = sum(Product.Family == "FDS"), RDS = sum(Product.Family == "RDS"), Research = sum(Product.Family == "Research"))

Resources