Identify change in categorical data across datapoints in R - r

I need help regarding the data manipulation in R .
My dataset looks something like this.
Name, country, age
Smith, Canada, 27
Avin, India, 25
Smith, India, 27
Robin, France, 28
Now I want to identify the number of changes that “Smith” has gone through (two) based on combination of Name and country only.
Basically, I want to compare each datapoint with other datapoints and identify the count of changes that have been there in the entire dataset for the combination of Name and Country only.

You can do this by comparing the lag values of the combination with it's current value by group using dplyr:
library(dplyr)
df %>%
group_by(Name) %>%
mutate(combination = paste(country, age),
lag_combination = lag(combination, default = 0, order_by = Name),
Changes = cumsum(combination != lag_combination)) %>%
slice(n()) %>%
select(Name, Changes)
Result:
# A tibble: 3 x 2
# Groups: Name [3]
Name Changes
<fctr> <int>
1 Avin 2
2 Robin 1
3 Smith 3
Notes:
dplyr:lag does not respect group_by(Name), so you need to add order_by = Name to lag by Name.
I'm setting a default using default = 0 so that the first entry of each group would not be NA.
Data:
df = read.table(text="Name, country, age
Smith, Canada, 27
Avin, India, 25
Smith, India, 27
Robin, France, 28
Smith, Canada, 27
Robin, France, 28
Avin, France, 26", header = TRUE, sep = ',')

Related

how to find top highest number in R

I'm new in R coding. I want to find code for this question. Display the city name and the total attendance of the five top-attendance stadiums. I have dataframe worldcupmatches. Please, if anyone can help me out.
Since you have not provided us a subset of your data (which is strongly recommended), I will create a tiny dataset with city names and attendance like so:
df = data.frame(city = c("London", "Liverpool", "Manchester", "Birmingham"),
attendance = c(2390, 1290, 8734, 5433))
Then your problem can easily be solved. For example, one of the base R approaches is:
df[order(df$attendance, decreasing = T), ]
You could also use dplyr which makes things look a little tidier:
library(dplyr)
df %>% arrange(desc(attendance))
Output of the both methods is your original data, but ordered from the highest to the lowest attendance:
city attendance
3 Manchester 8734
4 Birmingham 5433
1 London 2390
2 Liverpool 1290
If you specifically want to display a certain number of cities (or stadiums) with top highest attendance, you could do:
df[order(df$attendance, decreasing = T), ][1:3, ] # 1:3 takes the top 3 staidums
city attendance
3 Manchester 8734
4 Birmingham 5433
1 London 2390
Again, dplyr approach/code looks much nicer:
df %>% slice_max(n = 3, order_by = attendance)
city attendance
1 Manchester 8734
2 Birmingham 5433
3 London 2390

"Compare a variable by state abbreviations

How can I compare a variable by state abbreviations?
My data set has 5 variables currently. One of them is Location, and it is written as: "Raleigh, NC"
I need to create a variable that contains the two-character state abbreviation for each observation, and afterward another to group them by state. Each observation is of a college including their classification(private/public), instate/out of state tuition, and location.
This should do for you, if I understood your issue correctly.
Note: Please always share sample data using dput(your_dataset) or dput(head(your_dataset))
library(tidyverse)
d<- tibble(id = 1:3,
Location = c("Newyork, NY", "Raleigh, NC", "Delhi, IN"))
d %>% separate(Location,into = c("city", "country")) %>%
mutate_at(vars("city","country"), str_trim)
# A tibble: 3 x 3
id city country
<int> <chr> <chr>
1 1 Newyork NY
2 2 Raleigh NC
3 3 Delhi IN

how do I extract a part of data from a column and and paste it n another column using R?

I want to extract a part of data from a column and and paste it in another column using R:
My data looks like this:
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NULL,+3579514862,NULL,+5554848123)
data <- data.frame(names,country,mobile)
data
> data
names country mobile
1 Sia London +1234567890 NULL
2 Ryan Paris +3579514862
3 J Sydney +0123458796 NULL
4 Ricky Delhi +5554848123
I would like to separate phone number from country column wherever applicable and put it into mobile.
The output should be,
> data
names country mobile
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
You can use the tidyverse package which has a separate function.
Note that I rather use NA instead of NULL inside the mobile vector.
Also, I use the option, stringsAsFactors = F when creating the dataframe to avoid working with factors.
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NA, "+3579514862", NA, "+5554848123")
data <- data.frame(names,country,mobile, stringsAsFactors = F)
library(tidyverse)
data %>% as_tibble() %>%
separate(country, c("country", "number"), sep = " ", fill = "right") %>%
mutate(mobile = coalesce(mobile, number)) %>%
select(-number)
# A tibble: 4 x 3
names country mobile
<chr> <chr> <chr>
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
EDIT
If you want to do this without the pipes (which I would not recommend because the code becomes much harder to read) you can do this:
select(mutate(separate(as_tibble(data), country, c("country", "number"), sep = " ", fill = "right"), mobile = coalesce(mobile, number)), -number)

Count word frequency across multiple columns in R

I have a data frame in R with multiple columns with multi-word text responses, that looks something like this:
1a 1b 1c 2a 2b 2c
student job prospects money professors students campus
future career unsure my grades opportunities university
success reputation my job earnings courses unsure
I want to be able to count the frequency of words in columns 1a, 1b, and 1c combined, as well as 2a, 2b, and 2b combined.
Currently, I'm using this code to count word frequency in each column individually.
data.frame(table(unlist(strsplit(tolower(dat$1a), " "))))
Ideally, I want to be able to combine the two sets of columns into just two columns and then use this same code to count word frequency, but I'm open to other options.
The combined columns would look something like this:
1 2
student professors
future my grades
success earnings
job prospects students
career opportunities
reputation courses
money campus
unsure university
my job unsure
Here's a way using dplyr and tidyr packages. FYI, one should avoid having column names starting with a number. Naming them a1, a2... would make things easier in the long run.
df %>%
gather(variable, value) %>%
mutate(variable = substr(variable, 1, 1)) %>%
mutate(id = ave(variable, variable, FUN = seq_along)) %>%
spread(variable, value)
id 1 2
1 1 student professors
2 2 future my grades
3 3 success earnings
4 4 job prospects students
5 5 career opportunities
6 6 reputation courses
7 7 money campus
8 8 unsure university
9 9 my job unsure
Data -
df <- structure(list(`1a` = c("student", "future", "success"), `1b` = c("job prospects",
"career", "reputation"), `1c` = c("money", "unsure", "my job"
), `2a` = c("professors", "my grades", "earnings"), `2b` = c("students",
"opportunities", "courses"), `2c` = c("campus", "university",
"unsure")), .Names = c("1a", "1b", "1c", "2a", "2b", "2c"), class = "data.frame", row.names = c(NA,
-3L))
In general, you should avoid column names that start with numbers. That aside, I created a reproducible example of your problem and provided a solution using dplyr and tidyr. The substr() function inside the mutate_at assume your column names follow the [num][char] pattern in your example.
library(dplyr)
library(tidyr)
data <- tibble::tribble(
~`1a`, ~`1b`, ~`1c`, ~`2a`, ~`2b`, ~`2c`,
'student','job prospects', 'mone', 'professor', 'students', 'campus',
'future', 'career', 'unsure', 'my grades', 'opportunities', 'university',
'success', 'reputation', 'my job', 'earnings', 'courses', 'unsure'
)
data %>%
gather(key, value) %>%
mutate_at('key', substr, 0, 1) %>%
group_by(key) %>%
mutate(id = row_number()) %>%
spread(key, value) %>%
select(-id)
# A tibble: 9 x 2
`1` `2`
<chr> <chr>
1 student professor
2 future my grades
3 success earnings
4 job prospects students
5 career opportunities
6 reputation courses
7 mone campus
8 unsure university
9 my job unsure
If your end purpose is to count frequency (as opposed to switching from wide to long format), you could do
ave(unlist(df[,paste0("a",1:3)]), unlist(df[,paste0("a",1:3)]), FUN = length)
which will count the frequency of the elements of columns a1,a2,a3, where df denotes the data frame (and the columns are labeled a1,a2,a3,b1,b2,b3).

Complex Data Frame calculations in R

I currently am importing two tables (in the most basic form) that appear as such
Table 1
State Month Account Value
NY Jan Expected Sales 1.04
NY Jan Expected Expenses 1.02
Table 2
State Month Account Value
NY Jan Sales 1,000
NY Jan Customers 500
NY Jan F Expenses 1,000
NY Jan V Expenses 100
And my end goal is to create a 3rd data frame that includes the values of the first two rows and calculates a 4th column based off of functions
NextYearExpenses = (t2 F Expenses + t2 V Expenses)* t1 Expected Expenses
NextYearSales = (t2 sales) * t1 Expected Sales
So my desired output is as followed
State Month New Account Value
NY Jan Sales 1,040
NY Jan Expenses 1,122
I am relatively new to R and I think ifelse statements might be my best bet. I have tried merging the tables and calculating with simple column functions but with no real progress.
Any suggestions?
You may need to do some data wrangling but nothing out of the ordinary
require(dplyr)
Table1<-tibble(State=c("NY","NY"), Month=c("Jan","Jan"), Account=c("Expected Sales", "Expected Expenses"), Value=c(1.04,1.02))
Table2<-tibble(State=c("NY","NY","NY","NY"), Month=c("Jan","Jan","Jan","Jan"), Account=c("Sales", "Customers", "F Expenses","V Expenses"), Value=c(1000,500,1000,100))
First thing I do is rename the accounts to have a common name, i.e. expenses, this is going to help me to merge later on to Table1
Table2$Account[Table2$Account=="F Expenses"]<-"Expenses"
Table2$Account[Table2$Account=="V Expenses"]<-"Expenses"
then I use the group_by function and group by State, Month and Account and do the sum
Table2 <- Table2 %>% group_by(State, Month,Account) %>%
summarise(Tot_Value=sum(Value)) %>% ungroup()
head(Table2)
## State Month Account Tot_Value
## <chr> <chr> <chr> <dbl>
## 1 NY Jan Customers 500
## 2 NY Jan Expenses 1100
## 3 NY Jan Sales 1000
then something similar with the renaming for the accounts in table 1
Table1$Account[Table1$Account=="Expected Sales"]<-"Sales"
Table1$Account[Table1$Account=="Expected Expenses"]<-"Expenses"
Merge into a third table, Table 3
Table3<- left_join(Table1,Table2)
use mutate to do the needed operation
Table3 <- Table3 %>% mutate(Value2=Value*Tot_Value)
head(Table3)
## # A tibble: 2 x 6
## State Month Account Value Tot_Value Value2
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 NY Jan Sales 1.04 1000 1040
## 2 NY Jan Expenses 1.02 1100 1122
Here's what I did with dplyr and tidyr.
First I combined your initial tables with rbind into a single long format table. Since you have unique identifiers for each of the Account values, these don't need to be separate tables. Next I group_by State and Month to group these assuming eventually you'll have a variety of states/months. Next I summarise based on the values of Account that you specified and created two new columns. Finally to get it into the long format that you want I used gather from tidyr to go from wide format to long format. You can separate these commands into smaller chunks by deleting after the %>% to get a better idea of what each step does.
library(dplyr)
library(tidyr)
rbind(df,df2) %>%
group_by(State,Month) %>%
summarise(Expenses = (Value[which(Account == "F Expenses")] + Value[which(Account == "V Expenses")]) * Value[which(Account == "Expected Expenses")],
Sales = Value[which(Account == "Sales")] * Value[which(Account == "Expected Sales")]) %>%
gather(New_Account,Value, c(Expenses,Sales))
# A tibble: 2 x 4
# Groups: State [1]
# State Month New_Account Value
# <chr> <chr> <chr> <dbl>
#1 NY Jan Expenses 1122
#2 NY Jan Sales 1040
I'd recommend checking out the concept of "tidy data", as there are some real challenges with working on data with the structure you currently have. E.g. creating t3 should only take 2-3 lines of code, all of this is just to work around your data architecture:
library(tidyverse)
t1 <- data.frame(State = rep("NY", 2),
Month = rep(as.Date("2018-01-01"), 2),
Account = c("Expected Sales", "Expected Expenses"),
Value = c(1.04, 1.02),
stringsAsFactors = FALSE)
t2 <- data.frame(State = rep("NY", 4),
Month = rep(as.Date("2018-01-01"), 4),
Account = c("Sales", "Customers", "F Expenses", "V Expenses"),
Value = c(1000, 500, 1000, 100),
stringsAsFactors = FALSE)
t3 <- t2 %>%
spread(Account, Value) %>%
inner_join({
t1 %>%
spread(Account, Value)
}, by = c("State" = "State", "Month" = "Month")) %>%
mutate(NewExpenses = (`F Expenses` + `V Expenses`) * `Expected Expenses`,
NewSales = Sales * `Expected Sales`) %>%
select(State, Month, Sales = NewSales, Expenses = NewExpenses) %>%
gather(Sales, Expenses, key = `New Account`, value = Value)

Resources