I have data frame with 4 different years with values. I need to find how these values changes in all years i.e which city changes its value too often which is least.
City Ratio1 Ratio2 Ratio3 Ratio4
A 1.0177722 1.0173251 1.0133026 1.0140027
B 1.0132619 1.0122653 1.0128473 1.0111068
C 1.0689484 1.0640355 1.0625305 1.0544790
..... other 1000 entries
I have tried to do it by difference but no luck. The question is which city's ratio changed most between ratio1 to ratio4 and which is least changed.
I have tried using mutate function to calculate variance but it throws me an warning:
DF<- DF%>% mutate(vari = var(Ratio1:Ratio4,na.rm = T))
Warning messages:
1: In POP_2013_ratio:POP_2016_ratio :
numerical expression has 439 elements: only the first used
2: In POP_2013_ratio:POP_2016_ratio :
numerical expression has 439 elements: only the first used
R's data.table package has a pretty neat way to create new columns based on existing ones:
dt <- data.table(City = c("A", "B", "C"),
Ratio1 = c(1.0177722, 1.0132619, 1.0689484),
Ratio2 = c(1.0173251, 1.0122653, 1.0640355),
Ratio3 = c(1.0133026, 1.0128473, 1.0625305),
Ratio4 = c(1.0140027,1.0111068, 1.0544790))
>dt
City Ratio1 Ratio2 Ratio3 Ratio4
1: A 1.017772 1.017325 1.013303 1.014003
2: B 1.013262 1.012265 1.012847 1.011107
3: C 1.068948 1.064035 1.062531 1.054479
You can play around with some functions and then see what suits you best:
dt[, diff := Ratio4-Ratio1
][, abs_diff := abs(Ratio4-Ratio1)
][, range:= max(c(Ratio1, Ratio2, Ratio3, Ratio4))- min(c(Ratio1, Ratio2, Ratio3, Ratio4)), by = City
][,variance:=var(c(Ratio1, Ratio2, Ratio3, Ratio4)), by = City]
>dt
City Ratio1 Ratio2 Ratio3 Ratio4 diff abs_diff range variance
1: A 1.017772 1.017325 1.013303 1.014003 -0.0037695 0.0037695 0.0044696 5.174612e-06
2: B 1.013262 1.012265 1.012847 1.011107 -0.0021551 0.0021551 0.0021551 8.766456e-07
3: C 1.068948 1.064035 1.062531 1.054479 -0.0144694 0.0144694 0.0144694 3.609233e-05
When you're finally decided on the criteria to use (let's say, variance), you can select the top City using:
dt[order(-variance)][1]
>dt
City Ratio1 Ratio2 Ratio3 Ratio4 diff abs_diff range variance
1: C 1.068948 1.064035 1.062531 1.054479 -0.0144694 0.0144694 0.0144694 3.609233e-05
Related
(1) I have a data frame called COPY that looks like this
COPY <- data.frame (year = c(values_here),
Ceremony = c(values_here),
Award = c(values_here),
Winner = c(values_here),
Name = c(values_here),
Film = c(values_here),
)
(2) Some of the entry in the name and film column for some rows are mixed up
(3) I created a vector of all the names in the wrong place using this code.
COPY$Film[COPY['Award']=='Director' & COPY['Year']>1930]->name
the entry's where the Award = director and the year is greater than 1930 the name and film columns are mixed
(4) Now I would like to replace COPY$Name based on the conditions stated with my new name object. I tried this code.
replace(COPY$Name,COPY$Award =='Director' && COPY$Year>1930,name)
SO basically I'm trying to flip the Name and Film columns where the Award column == director and the year column is greater than 1930.
Lacking data, try this:
COPY <- data.frame (year = 2000:2002,
Ceremony = NA,
Award = c("A", "Director", "B"),
Winner = NA,
Name = c("A","B","C"),
Film = c("1","2","3")
)
swap <- COPY$Award == "Director"
COPY <- transform(COPY, Name = ifelse(swap, Film, Name), Film = ifelse(swap, Name, Film))
COPY
# year Ceremony Award Winner Name Film
# 1 2000 NA A NA A 1
# 2 2001 NA Director NA 2 B
# 3 2002 NA B NA C 3
set.seed(1)
data=data.frame(SCHOOL = rep(1:10, each = 1000), GRADE = sample(7:12, r = T, size = 10000),SCORE = sample(1:100, r = T, size = 10000))
I have 'data' that contains information about student test score. I wish to: count how many GRADE for each SCHOOL, and then take the smallest value of GRADE for all SCHOOLS. Like this:
For each SCHOOL count the number of rows for a specific GRADE.
Then for each GRADE find the smallest values across all SCHOOLs.
Finally I wish to take a random sample based on the smallest value found in step 2.
So basically in this basic example with two SCHOOLs and GRADE 7 and GRADE 8:
SCHOOL 1 has 2 SCOREs for GRADE 7 and SCHOOL 1 has 3 SCOREs for GRADE 8.
SCHOOL 2 has 1 SCOREs for GRADE 7 and SCHOOL 2 has 4 SCOREs for GRADE 8.
So the new data contains one SCORE for GRADE 7 from SCHOOL 1 and SCHOOL 2, and three SCORE for GRADE 8 from SCHOOL 1 and SCHOOL 2 and these SCORE that are picked are RANDOMLY SAMPLED.
like this:
My attempt:
data[, .SD[sample(x = .N, size = min(sum(GRADE), .N))], by = .(SCHOOL,GRADE]
This follows your description of how to do it step-by-step.
library(data.table)
setDT(data)
data[, N := .N, .(SCHOOL, GRADE)]
data[, N := min(N), GRADE]
data[, .(SCORE = sample(SCORE, N)), .(SCHOOL, GRADE, N)][, -'N']
If you have multiple SCORE-like columns and you want keep the same rows from each then you can use .SD like in your attempt:
data[, .SD[sample(.N, N)], .(SCHOOL, GRADE, N)][, -'N']
I am a newbie in programming with R, and this is my first question ever here on Stackoverflow.
Let's say that I have a data frame with 4 columns:
(1) Individual ID (numeric);
(2) Morality of the individual (factor);
(3) The city (factor);
(4) Numbers of books possessed (numeric).
Person_ID <- c(1,2,3,4,5,6,7,8,9,10)
Morality <- c("Bad guy","Bad guy","Bad guy","Bad guy","Bad guy",
"Good guy","Good guy","Good guy","Good guy","Good guy")
City <- c("NiceCity", "UglyCity", "NiceCity", "UglyCity", "NiceCity",
"UglyCity", "NiceCity", "UglyCity", "NiceCity", "UglyCity")
Books <- c(0,3,6,9,12,15,18,21,24,27)
mydf <- data.frame(Person_ID, City, Morality, Books)
I am using this code in order to get the counts by each category for the variable Morality in each city:
mycounts<-melt(mydf,
idvars = c("City"),
measure.vars = c("Morality"))%>%
dcast(City~variable+value,
value.var="value",fill=0,fun.aggregate=length)
The code gives this kind of table with the sums:
names(mycounts)<-gsub("Morality_","",names(mycounts))
mycounts
City Bad guy Good guy
1 NiceCity 3 2
2 UglyCity 2 3
I wonder if there is a similar way to use dcast() for numerical variables (inside the same script) e.g. in order to get a sum the Books possessed by all individuals living in each city:
#> City Bad guy Good guy Books
#>1 NiceCity 3 2 [Total number of books in NiceCity]
#>2 UglyCity 2 3 [Total number of books in UglyCity]
Do you mean something like this:
mydf %>%
melt(
idvars = c("City"),
measure.vars = c("Morality")
) %>%
dcast(
City ~ variable + value,
value.var = "Books",
fill = 0,
fun.aggregate = sum
)
#> City Morality_Bad guy Morality_Good guy
#> 1 NiceCity 18 42
#> 2 UglyCity 12 63
I am using RStudio for data analysis in R. I currently have a dataframe which is in a long format. I want to convert it into the wide format.
An extract of the dataframe (df1) is shown below. I have converted the first column into a factor.
Extract:
df1 <- read.csv("test1.csv", stringsAsFactors = FALSE, header = TRUE)
df1$Respondent <- factor(df1$Respondent)
df1
Respondent Question CS Imp LOS Type Hotel
1 1 Q1 Fully Applied High 12 SML ABC
2 1 Q2 Optimized Critical 12 SML ABC
I want a new dataframe (say, df2) to look like this:
Respondent Q1CS Q1Imp Q2CS Q2Imp LOS Type Hotel
1 Fully Applied High Optimized Critical 12 SML ABC
How can I do this in R?
Additional notes: I have tried looking at the tidyr package and its spread() function but I am having a hard time implementing it to this specific problem.
This can be achieved with a gather-unite-spread approach
df %>%
group_by(Respondent) %>%
gather(k, v, CS, Imp) %>%
unite(col, Question, k, sep = "") %>%
spread(col, v)
# Respondent LOS Type Hotel Q1CS Q1Imp Q2CS Q2Imp
#1 1 12 SML ABC Fully Applied High Optimized Critical
Sample data
df <- read.table(text =
" Respondent Question CS Imp LOS Type Hotel
1 1 Q1 'Fully Applied' High 12 SML ABC
2 1 Q2 'Optimized' Critical 12 SML ABC", header = T)
In data.table, this can be done in a one-liner....
dcast(DT, Respondent ~ Question, value.var = c("CS", "Imp"), sep = "")[DT, `:=`(LOS = i.LOS, Type = i.Type, Hotel = i.Hotel), on = "Respondent"][]
Respondent CSQ1 CSQ2 ImpQ1 ImpQ2 LOS Type Hotel
1: 1 Fully Applied Optimized High Critical 12 SML ABC
explained step by step
create sample data
DT <- fread("Respondent Question CS Imp LOS Type Hotel
1 Q1 'Fully Applied' High 12 SML ABC
1 Q2 'Optimized' Critical 12 SML ABC", quote = '\'')
Cast a part of the datatable to desired format by question
colnames might not be what you want... you can always change them using setnames().
dcast(DT, Respondent ~ Question, value.var = c("CS", "Imp"), sep = "")
# Respondent CSQ1 CSQ2 ImpQ1 ImpQ2
# 1: 1 Fully Applied Optimized High Critical
Then join by reference on the orikginal DT, to get the rest of the columns you need...
result.from.dcast[DT, `:=`( LOS = i.LOS, Type = i.Type, Hotel = i.Hotel), on = "Respondent"]
I am trying to figure out how to generate a new column in R that accounts for whether a politician "i" remains in the same party or defect for a given legislatures "l". These politicians and parties are recognized because of indexes. Here is an example of how my data originally looks like:
## example of data
names <- c("Jesus Martinez", "Anrita blabla", "Paco Pico", "Reiner Steingress", "Jesus Martinez Porras")
Parti.affiliation <- c("Winner","Winner","Winner", "Loser", NA)#NA, "New party", "Loser", "Winner", NA
Legislature <- c(rep(1, 5), rep(2,5), rep(3,5), rep(4,5), rep(5,5), rep(6,5))
selection <- c(rep("majority", 15), rep("PR", 15))
sex<- c("Male", "Female", "Male", "Female", "Male")
Election<- c(rep(1955, 5), rep(1960, 5), rep(1965, 5), rep(1970,5), rep(1975,5), rep(1980,5))
d<- data.frame(names =factor(rep(names, 6)), party.affiliation = c(rep(Parti.affiliation,5), NA, "New party", "Loser", "Winner", NA), legislature = Legislature, selection = selection, gender =rep(sex, 6), Election.date = Election)
## genrating id for politician and party.affiliation
d$id_pers<- paste(d$names, sep="")
d <- arrange(d, id_pers)
d <- transform(d, id_pers = as.numeric(factor(id_pers)))
d$party.affiliation1<- as.numeric(d$party.affiliation)
The expected outcome should show the following: if a politician (showed through the column "id_pers") has changed their values in the column "party.affiliation1", a value 1 will be assigned in a new column called "switch", otherwise 0. The same procedure should be done with every politician in the dataset, so the expected outcome should be like this:
d["switch"]<- c(1, rep(0,4), NA, rep(0,6), rep(NA, 6),1, rep(0,5), rep (0,5),1) # 0= remains in the same party / 1= switch party affiliation.
As example, you can see in this data.frame that the first politician, called "Anrita blabla", was a candidate of the party '3' from the 1st to 5th legislature. However, we can observe that "Anrita" changes her party affiliation in the 6th legislature, so she was a candidate for the party '2'. Therefore, the new column "switch" should contain a value '1' to reflect this Anrita's change of party affiliation, and '0' to show that "Anrita" did not change her party affiliation for the first 5 legislatures.
I have tried several approaches to do that (e.g. loops). I have found this strategy the simplest one, but it does not work :(
## add a new column based on raw values
ind <- c(FALSE, party.affiliation1[-1L]!= party.affiliation1[-length(party.affiliation1)] & party.affiliation1!= 'Null')
d <- d %>% group_by(id_pers) %>% mutate(this = ifelse(ind, 1, 0))
I hope you find this explanation clear. Thanks in advance!!!
I think you could do:
library(tidyverse)
d%>%
group_by(id_pers)%>%
mutate(switch=as.numeric((party.affiliation1-lag(party.affiliation1)!=0)))
The first entry will be NA as we don't have information on whether their previous, if any, party affiliation was different.
Edit: We use the default= parameter of lag() with ifelse() nested to differentiate the first values.
df=d%>%
group_by(id_pers)%>%
mutate(switch=ifelse((party.affiliation1-lag(party.affiliation1,default=-99))>90,99,ifelse(party.affiliation1-lag(party.affiliation1)!=0,1,0)))
Another approach, using data.table:
library(data.table)
# Convert to data.table
d <- as.data.table(d)
# Order by election date
d <- d[order(Election.date)]
# Get the previous affiliation, for each id_pers
d[, previous_party_affiliation := shift(party.affiliation), by = id_pers]
# If the current affiliation is different from the previous one, set to 1
d[, switch := ifelse(party.affiliation != previous_party_affiliation, 1, 0)]
# Remove the column
d[, previous_party_affiliation := NULL]
As Haboryme has pointed out, the first entry of each person will be NA, due to the lack of information on previous elections. And the result would give this:
names party.affiliation legislature selection gender Election.date id_pers party.affiliation1 switch
1: Anrita blabla Winner 1 majority Female 1955 1 NA NA
2: Anrita blabla Winner 2 majority Female 1960 1 NA 0
3: Anrita blabla Winner 3 majority Female 1965 1 NA 0
4: Anrita blabla Winner 4 PR Female 1970 1 NA 0
5: Anrita blabla Winner 5 PR Female 1975 1 NA 0
6: Anrita blabla New party 6 PR Female 1980 1 NA 1
(...)
EDITED
In order to identify the first entry of the political affiliation and assign the value 99 to them, you can use this modified version:
# Note the "fill" parameter passed to the function shift
d[, previous_party_affiliation := shift(party.affiliation, fill = "First"), by = id_pers]
# Set 99 to the first occurrence
d[, switch := ifelse(party.affiliation != previous_party_affiliation, ifelse(previous_party_affiliation == "First", 99, 1), 0)]