Getting a few tables together - r

I have three tables read into R as dataframes.
Table1:
Student ID School_Name Campus Area
4356791 BCCS Northwest Springdale
03127. BZS South Vernon
12437. BCCS. South Vernon
Table 2:
ProctorID. Date. Score. Student ID Form#
0211 10/05/16 75.57 55612 25432178
0211 10/17/16 83.04 55612 47135671
5134 10/17/16 63.28 02613 2371245
Table 3:
ProctorID First. Last. Campus Area
O211. Simone Lewis. Northwest Springdale
5134. Mona. Yashamito Northwest Springdale
0712. Steven. Lewis. South Vernon
I want to combine the data frames and create a table with the scores next to each other for each area, by school name. I want an output like the following:
School_Name Form# Northwest Springdale Southvernon
BCCS. 2543127. 83.04. 63.25
BCCS. 35674. 75.14. *
BZS. 5321567. 65.2. 62.3
A particular form for a particular school may not have a score for a certain area. Any ideas? I have been playing with sqldf package. Also Is it possible to manipulate this in R without using any sql?

To cast, something like this:
library(reshape2)
casted_df <- dcast(df, ... ~ "Campus Area", value.var="Score.")
An example that seems to work for me:
df1 <- data.frame("StudentID" = 1:3, "SchoolName" = c("School1", "School2", "School3"), "Area" = c("Area1", "Area2", "Area3"))
df2 <- data.frame("StudentID" = 1:3, "Score" = 100:102, "Proctor" = 4:6)
df3 <- data.frame("Proctor" = 4:6, "Area" = c("Area1", "Area2", "Area3"), "Name" = c("John", "Jane", "Jim"))
combined <- merge(df1, df2, by.x = "StudentID")
combined2 <- merge(combined, df3, by.x = "Proctor", by.y="Proctor")
library(reshape2)
final <- dcast(combined2, ... ~ Area.x, value.var="Score")

Related

Assigning Value to New Variable Based on Specific Values in Another Variable in R

I have a data.frame that contains state names and I would like to create a new variable called "region" in which a value is assigned based on the state that is found under the "state" variable.
For example, if the state variable has "Alabama" or "Georgia", I would like to have "Region" assigned as "South". If state is "Washington" or "California", I would like it assigned to "West". I have to do this for each of the 48 contiguous U.S. states, and I'm having difficulty figuring out the best way to do this. Any help in this (I'm sure simple) procedure would be great. What I am looking for is something like this in the end:
State Region
Wyoming West
Michigan Midwest
Alabama South
Georgia South
California West
Texas Central
And to be clear, I don't have the regions in a separate file, i have to create this as a new variable and create the region names myself. I'm just looking for a way that the code can go through all 3000 lines that I have and can automatically assign the region name once I tell it how to do so.
Rather than type the region for every state, you can use the built-in "state.name" and "state.region" variables from the 'datasets' package (like Jon Spring suggests in his comment), e.g.
library(tidyverse)
library(datasets)
state_lookup_table <- data.frame(name = state.name,
region = state.region)
my_df <- data.frame(place = c("Washington", "California"),
value = c(1000, 2000))
my_df
#> place value
#> 1 Washington 1000
#> 2 California 2000
my_df %>%
left_join(state_lookup_table, by = c("place" = "name"))
#> place value region
#> 1 Washington 1000 West
#> 2 California 2000 West
Created on 2022-09-02 by the reprex package (v2.0.1)
I would go this way:
df <- data.frame(name = c("john", "will", "thomas", "Ali"),
state = c("California", "Alabama", "Washington", "Georgia"))
region_df <- data.frame(state= c("Alabama", "Georgia", "Washington"),
region = c("south", "south", "west"))
merged.df <- merge(df, region_df, all.x = TRUE, on= "state")
I think you need a reference to do so. For your specific question, a dict would be the best solution.
ref_ge <- {}
ref_ge["Georgia"]="South"
ref_ge["Alabama"]="South"
ref_ge["California"]="West"
ref1["Georgia"]
#Or, if you could read the state->region information from an excel to a dataframe
df=data.frame(state=c("Georgia","Alabama","California"),region=c("South","South","West"))
ref2 <- df$region
names(ref2) <- df$state
ref2["Georgia"]

Create a two new column by mapping multiple column

How to match columns in R and extract value. As an example: I want to match on the basis of Name and City columns of dataframe_one with dataframe_two and then return the output with two another column temp and ID. If it matches it should return TRUE and ID too.
My input is:
dataframe_one
Name City
Sarah ON
David BC
John KN
Diana AN
Judy ON
dataframe_two
Name City ID
Dave ON 1092
Diana AN 2314
Judy ON 1290
Ari KN 1450
Shanu MN 1983
I want the output to be
Name City temp ID
Sarah ON FALSE NA
David BC TRUE 1450
John KN TRUE 1983
Diana AN FALSE NA
Judy ON FALSE NA
One thing that makes answering questions of this type easier is if you at least put the data frames in R, like so:
df1 <- data.frame(stringsAsFactors=FALSE,
Name = c("Sarah", "David", "John", "Diana", "Judy"),
City = c("ON", "BC", "KN", "AN", "ON")
)
df2 <- data.frame(stringsAsFactors=FALSE,
Name = c("Dave", "Diana", "Judy", "Ari", "Shanyu"),
City = c("ON", "AN", "ON", "KN", "MN"),
ID = c(1092, 2314, 1290, 1450, 1983)
)
Then search existing Stack Overflow questions that have answered similar questions (e.g. How to join (merge) data frames (inner, outer, left, right)).
Given that neither of your original dfs contain the column "Temp" you would need to create it in the joined (merged) data frame.
We'll be able to help a lot more if you at least make a start and then the community will help you troubleshoot.
That being said, I can't for the life of me figure out how you would generate your output df from the inputs.
Using biomiha code to generate df1 and df2:
df1 <- data.frame(stringsAsFactors=FALSE,
Name = c("Sarah", "David", "John", "Diana", "Judy"),
City = c("ON", "BC", "KN", "AN", "ON")
)
df2 <- data.frame(stringsAsFactors=FALSE,
Name = c("Dave", "Diana", "Judy", "Ari", "Shanyu"),
City = c("ON", "AN", "ON", "KN", "MN"),
ID = c(1092, 2314, 1290, 1450, 1983)
)
you may try:
library(dplyr)
df1 %>%
left_join(df2, by = c("Name" = "Name", "City" = "City")) %>%
mutate(temp = !is.na(ID))
gives the output:
Name City ID temp
1 Sarah ON NA FALSE
2 David BC NA FALSE
3 John KN NA FALSE
4 Diana AN 2314 TRUE
5 Judy ON 1290 TRUE

R observation strs split - multiple value in columns

I have a dataframe in R concerning houses. This is a small sample:
Address Type Rent
Glasgow;Scotland House 1500
High Street;Edinburgh;Scotland Apartment 1000
Dundee;Scotland Apartment 800
South Street;Dundee;Scotland House 900
I would like to just pull out the last two instances of the Address column into a City and County column in my dataframe.
I have used mutate and strsplit to split this column by:
data<-mutate(dataframe, split_add = strsplit(dataframe$Address, ";")
I now have a new column in my dataframe which resembles the following:
split_add
c("Glasgow","Scotland")
c("High Street","Edinburgh","Scotland")
c("Dundee","Scotland")
c("South Street","Dundee","Scotland")
How to I extract the last 2 instances of each of these vector observations into columns "City" and "County"?
I attempted:
data<-mutate(data, city=split_add[-2] ))
thinking it would take the second instance from the end of the vectors- but this did not work.
using tidyr::separate() with the fill = "left" option is probably your best bet...
dataframe <- read.table(header = T, stringsAsFactors = F, text = "
Address Type Rent
Glasgow;Scotland House 1500
'High Street;Edinburgh;Scotland' Apartment 1000
Dundee;Scotland Apartment 800
'South Street;Dundee;Scotland' House 900
")
library(tidyr)
separate(dataframe, Address, into = c("Street", "City", "County"),
sep = ";", fill = "left")
# Street City County Type Rent
# 1 <NA> Glasgow Scotland House 1500
# 2 High Street Edinburgh Scotland Apartment 1000
# 3 <NA> Dundee Scotland Apartment 800
# 4 South Street Dundee Scotland House 900
I thinking about another way of dealing with this problem.
1.Creating a dataframe with the split_add column data
c("Glasgow","Scotland")
c("High Street","Edinburgh","Scotland")
c("Dundee","Scotland")
c("South Street","Dundee","Scotland")
test_data <- data.frame(split_add <- c("Glasgow, Scotland",
"High Street, Edinburgh, Scotland",
"Dundee, Scotland",
"South Street, Dundee, Scotland"),stringsAsFactors = F)
names(test_data) <- "address"
2.Use separate() from tidyr to split the column
library(tidyr)
new_test <- test_data %>% separate(address,c("c1","c2","c3"), sep=",")
3.Use dplyr and ifelse() to only reserve the last two columns
library(dplyr)
new_test %>%
mutate(city = ifelse(is.na(c3),c1,c2),county = ifelse(is.na(c3),c2,c3)) %>%
select(city,county)
The final data looks like this.
Assuming that you're using dplyr
data <- mutate(dataframe, split_add = strsplit(Address, ';'), City = tail(split_add, 2)[1], Country = tail(split_add, 1))

Frequency tables by groups with weighted data in R

I wish to calculate two kind of frequency tables by groups with weighted data.
You can generate reproducible data with the following code :
Data <- data.frame(
country = sample(c("France", "USA", "UK"), 100, replace = TRUE),
migrant = sample(c("Native", "Foreign-born"), 100, replace = TRUE),
gender = sample (c("men", "women"), 100, replace = TRUE),
wgt = sample(100),
year = sample(2006:2007)
)
Firstly, I try to calculate a frequency table of migrant status (Native VS Foreign-born) by country and year. I wrote the following code using the packages questionr and plyr :
db2006 <- subset (Data, year == 2006)
db2007 <- subset (Data, year == 2007)
result2006 <- as.data.frame(cprop(wtd.table(db2006$migrant, db2006$country, weights=db2006$wgt),total=FALSE))
result2007 <- as.data.frame(cprop(wtd.table(db2007$migrant, db2007$country, weights=db2007$wgt),total=FALSE))
result2006<-rename (result2006, c(Freq = "y2006"))
result2007<-rename (result2007, c(Freq = "y2007"))
result <- merge(result2006, result2007, by = c("Var1","Var2"))
In my real database, I have 10 years so it takes times to apply this code for all the years. Does anyone know a faster way to do it ?
I also wish to calculate the share of women and men among migrant status by country and year. I am looking for something like :
Var1 Var2 Var3 y2006 y2007
Foreign born France men 52 55
Foreign born France women 48 45
Native France men 51 52
Native France women 49 48
Foreign born UK men 60 65
Foreign born UK women 40 35
Native UK men 48 50
Native UK women 52 50
Does anyone have an idea of how I can get these results?
You could do this by: making a function with the code you've already written; using lapply to iterate that function over all years in your data; then using Reduce and merge to collapse the resulting list into one data frame. Like this:
# let's make your code into a function called 'tallyho'
tallyho <- function(yr, data) {
require(dplyr)
require(questionr)
DF <- filter(data, year == yr)
result <- with(DF, as.data.frame(cprop(wtd.table(migrant, country, weights = wgt), total = FALSE)))
# rename the last column by year
names(result)[length(names(result))] <- sprintf("y%s", year)
return(result)
}
# now iterate that function over all years in your original data set, then
# use Reduce and merge to collapse the resulting list into a data frame
NewData <- lapply(unique(Data$year), function(x) tallyho(x, Data)) %>%
Reduce(function(...) merge(..., all=T), .)

merge tables in R on specified fields

I have 2 csv files I want to merge on a linked key:
results.csv column headings:
schoolID, schoolName, Easting, Northing
123933, Mark College, 338371, 147812
139335, Hemsworth Arts and Community Academy, 442859, 413420
107563, Sowerby Bridge High School, 406122, 424146
137706, Willenhall E-ACT Academy, 398288, 300042
schools.csv column headings:
URN, LA (code), LA (name), EstablishmentNumber, EstablishmentName
123933, 201, City of London, 3614, Mark College
100001, 202, Camden, 6005, City of London School for Girls
139335, 201, City of London, 6006, Hemsworth Arts and Community Academy
100003, 201, City of London, 6007, City of London School
URN == schoolID and I want a final file showing data under column headings:
schoolID, schoolName, Easting, Northing, LA (name)
I've tried the following code:
res_data <- read.csv("C:/results.csv",
head=TRUE,sep=",")
school_data <- read.csv("C:/schools.csv",
head=TRUE,sep=",")
merge_data <- merge(x = res_data , y = school_data[c(1,3),], by.x = "schoolID", by.y = "URN" )
head(merge_data, 3)
But the result is just merging all the headings and not the data:
schoolID, schoolName, Easting, Northing, URN, LA (code), LA (name), EstablishmentNumber, EstablishmentName
Tested with supplied test data
merge_data <- merge(x = res_data , y = school_data[,c(1,3)], by.x = "schoolID", by.y = "URN" )
(TWO changes!)
I think you've cut the third line instead the third column from school_data. You also need to include the merge column.

Categories

Resources