Related
I have a data frame on name (df) as follows.
ID name
1 Xiaoao
2 Yukata
3 Kim
4 ...
Examples of API are like this.
European-SouthSlavs,0.2244 Muslim-Pakistanis-Bangladesh,0.0000 European-Italian-Italy,0.0061 ...
And I would like to add a new column using API that returns nationality scores up to 39 nationalities and I would like to list up to top 3 scores per name. My desired outcomes as follows.
ID name score nat
1 Xiaoao 0.7361 Chinese
1 Xiaoao 0.1721 Korean
1 Xiaoao 0.0721 Japanese
2 Yukata 0.8121 Japanese
2 Yukata 0.0811 Chinese
2 Yukata 0.0122 Korean
3 Kim 0.6532 Korean
3 Kim 0.2182 Chinese
3 Kim 0.0981 Japanese
4 ... ... ...
Below is my some of scratch to get it done. But I failed to get the desired outcomes for a number errors.
df_result <- purrr::map_dfr(df$name, function(name) {
result <- GET(paste0("http://www.name-prism.com/api_token/nat/csv/",
"API TOKEN","/", URLencode(df$name)))
if(http_error(result)){
NULL
}else{
nat<- content(result, "text")
nat<- do.call(rbind, strsplit(strsplit(nat, split = "(?<=\\d)\n", perl=T)[[1]],","))
#first three nationalities
top_nat <- nat[order(as.numeric(nat[,2]), decreasing = T)[1:3],]
c(df$name,as.vector(t(top_nat)))
}
})
First, the results of top scores were based on the entire data rather than per name.
Second, I faced an error saying "Error in dplyr::bind_rows():! Argument 1 must have names."
If you can add any comments on my codings, I will appreciate it!
Thank you in advance.
The output of each iteration of the map_dfr should be a dataframe for which to bind rows:
library(tidyverse)
library(httr)
df <- data.frame(name = c("Xiaoao", "Yukata", "Kim"))
map_dfr(df$name, function(name) {
data.frame(name = df$name, score = sample(1:10, 1))
})
Instead of concatenating name with top_nat at the end of your function, you should be making it a data.frame!
I am an intermediate user of R and have a data set of ~850,000 rows that was edited through Stata, saved as a csv, but about .01% of the rows got split to the following row after column 11. I am trying to get the file back to its original form, with no split rows. I was using column 4 "type of" as the required condition, but someone below pointed out this won't work. I tested this and all object types in the data frame are indeed "integers". Maybe this would work if I changed the column "type of" for this problem, but here was what I tried:
wages <- for (i in wages) {
if(typeof(wages[i,4]) == "integer") {
cat(i-1, i)
}
}
all I get is NAs.
When trying:
for (i in wages) {
if(typeof(i[ ,4]) == "integer") {
append(i-1, i, after = length(i-1))
}
}
it says:
Error in [.default(i, , 4) : incorrect number of dimensions
I have spent hours searching for solutions and trying different methods with no success. Thanks in advance for any help.
Snippet of data:
WD County_Name State_Name Cons_Code constructiondescription wagegroup Rate_Effective_Date hourly
113352 CO20190006 Adams Colorado Highway SUCO2011-001 9/15/2011 22.67
113353 CO20190004 Adams Colorado Residential PLUM0058-011 7/1/2018 32.75
113354 (pipefitters exclude hvac pipe) SOUTHWEST CO 8001 METRO 1352 100335 plumber
113355 CO20190004 Adams Colorado Residential PLUM0145-005 8/1/2016 24.58
fringe Rate_Type Craft_Title region st_abbr stcnty_fips mr supergrp
113352 8.73 Open power equipment operator: broom/sweeper arapahoe SOUTHWEST CO 8001 METRO 1352
113353 14.85 CBA plumber/pipefitter (plumbers include hvac pipe) NA NA
113354 1 NA NA
113355 10.47 CBA plumber (plumbers include hvac pipe) & pipefitters (exclude hvac pipe) SOUTHWEST CO 8001 METRO 1352
group key_craft key
113352 100335 operator 1
113353 NA NA
113354 NA NA
113355 100335 plumber 1
Reproducible data:
data <- data.frame(c("CO20190006","CO20190004","(pipefitters exclude hvac pipe)","CO20190004"), #1
c("Adams","Adams","SOUTHWEST","Adams"), #2
c("Colorado","Colorado","CO","Colorado"), #3
c("Highway","Residential","8001","Residential"), #4
c("","","METRO",""), #5
c("SUCO2011-001","PLUM0058-011","1352","PLUM0145-005"), #6
c("9/15/2011","7/1/2018","100335","8/1/2016"), #7
c("22.67","32.75","plumber","24.58"), #8
c("8.73","14.85","1","10.47"), #9
c("Open","CBA","","CBA"), #10
c("power equipment operator: broom/sweeper arapahoe","plumber/pipefitter (plumbers include hvac pipe)","",
"plumber (plumbers include hvac pipe) & pipefitters (exclude hvac pipe)"), #11
c("SOUTHWEST","","","SOUTHWEST"), #12
c("CO","","","CO"), #13
c("8001",NA,NA,"8001"), #14
c("METRO","","","METRO"), #15
c("1352",NA,NA,"1352"), #16
c("100335",NA,NA,"100335"), #17
c("operator","","","plumber"), #18
c("1",NA,NA,"1")) #19
colnames(data) <- c("WD","County_Name","State_Name","Cons_Code","constructiondescription","wagegroup","Rate_Effective_Date",
"hourly","fringe","Rate_Type","Craft_Title","region","st_abbr","stcnty_fips","mr","supergrp","group",
"key_craft","key")
The following solution should do the job:
new_data <- NULL
i <- 1
while (i <= nrow(data)) {
new_data <- rbind(new_data, data[i, ])
if (all(is.na(data[i, c(14, 16, 17, 19)]))) {
if (!paste(data[i,11], data[i+1,1]) %in% levels(new_data[,11])) {
levels(new_data[,11]) <- c(
levels(new_data[,11]), paste(data[i,11], data[i+1,1])
)
}
new_data[nrow(new_data),11] <- paste(data[i,11], data[i+1,1])
new_data[nrow(new_data),] <- cbind(new_data[nrow(new_data),1:11], data[i+1,2:9])
i <- i + 2
} else {i <- i + 1}
}
Note that as it stands, your data frame (DF) stores strings as factors because when creating a DF using the data.frame() function, one of the settings is stringsAsFactors = TRUE by default. You can read more about factors and their levels in data frames here.
Therefore, in the code above, we first add a new row to the clean new_data:
new_data <- rbind(new_data, data[i, ])
Then we test whether that row is split by checking if there are NAs in columns 14, 16, 17, and 19:
if (all(is.na(data[i, c(14, 16, 17, 19)])))
If so, in order for us to be able to merge the cell in column 11 of the split row with the 1st cell of the following row, we first need to check whether that level already exists in that column and if not:
if (!paste(data[i,11], data[i+1,1]) %in% levels(new_data[,11]))
it needs to be added to the list of levels before merging:
levels(new_data[,11]) <- c(levels(new_data[,11]), paste(data[i,11], data[i+1,1]))
And then, finally, the merging (to complete the cell in column 11 of the split row) can be done:
new_data[nrow(new_data),11] <- paste(data[i,11], data[i+1,1])
After that, the remaining missing columns are added to the split row in question:
new_data[nrow(new_data),] <- cbind(new_data[nrow(new_data),1:11], data[i+1,2:9])
Version LITE
Now, I am suspecting that all this checking for factors and adding new ones is taking some extra time, so I propose you may use a new version of this code, which turns the implicated column 11 into just characters, instead of factors. I think it makes sense in this particular data set, as specifically that column does not seem to be intended as factors anyway. That way, all the factor checking / adding can be skipped:
data[,11] <- as.character(data[,11])
new_data <- NULL
i <- 1
while (i <= nrow(data)) {
new_data <- rbind(new_data, data[i, ])
if (all(is.na(data[i, c(14, 16, 17, 19)]))) {
new_data[nrow(new_data),11] <- paste(data[i,11], data[i+1,1])
new_data[nrow(new_data),] <- cbind(new_data[nrow(new_data),1:11], data[i+1,2:9])
i <- i + 2
} else {i <- i + 1}
}
Let me know if this improved the speed!
I'm looking for an easy fix to read a txt file that looks like this when opened in excel:
IDmaster By_uspto App_date Grant_date Applicant Cited
2 1 19671106 19700707 Motorola Inc 1052446
2 1 19740909 19751028 Gen Motors Corp 1062884
2 1 19800331 19820817 Amp Incorporated 1082369
2 1 19910515 19940719 Dell Usa L.P. 389546
2 1 19940210 19950912 Schueman Transfer Inc. 1164239
2 1 19940217 19950912 Spacelabs Medical Inc. 1164336
EDIT: Opening the txt file in notepad looks like this (with commas). The last two rows exhibit the problem.
IDmaster,By_uspto,App_date,Grant_date,Applicant,Cited
2,1,19671106,19700707,Motorola Inc,1052446
2,1,19740909,19751028,Gen Motors Corp,1062884
2,1,19800331,19820817,Amp Incorporated,1082369
2,1,19910515,19940719,Dell Usa L.P.,389546
2,1,19940210,19950912,Schueman Transfer, Inc.,1164239
2,1,19940217,19950912,Spacelabs Medical, Inc.,1164336
The problem is that some of the Applicant names contain commas so that they are read as if they belong in a different column, which they actually don't.
Is there a simple way to
a) "teach" R to keep string variables together, regardless of commas in between
b) read in the first 4 columns, and then add an extra column for everything behind the last comma?
Given the length of the data I can't open it entirely in excel which would be otherwise a simple alternative.
If your example is written in a "Test.csv" file, try with:
read.csv(text=gsub(', ', ' ', paste0(readLines("Test.csv"),collapse="\n")),
quote="'",
stringsAsFactors=FALSE)
It returns:
# IDmaster By_uspto App_date Grant_date Applicant Cited
# 1 2 1 19671106 19700707 Motorola Inc 1052446
# 2 2 1 19740909 19751028 Gen Motors Corp 1062884
# 3 2 1 19800331 19820817 Amp Incorporated 1082369
# 4 2 1 19910515 19940719 Dell Usa L.P. 389546
# 5 2 1 19940210 19950912 Schueman Transfer Inc. 1164239
# 6 2 1 19940217 19950912 Spacelabs Medical Inc. 1164336
This provides a very silly workaround but it does the trick for me (because I don't really care about the Applicant names atm. However, I'm hoping for a better solution.
Step 1: Open the .txt file in notepad, and add five column names V1, V2, V3, V4, V5 (to be sure to capture names with multiple commas).
bc <- read.table("data.txt", header = T, na.strings = T, fill = T, sep = ",", stringsAsFactors = F)
library(data.table)
sapply(bc, class)
unique(bc$V5) # only NA so can be deleted
setDT(bc)
bc <- bc[,1:10, with = F]
bc$Cited <- as.numeric(bc$Cited)
bc$Cited[is.na(bc$Cited)] <- 0
bc$V1 <- as.numeric(bc$V1)
bc$V2 <- as.numeric(bc$V2)
bc$V3 <- as.numeric(bc$V3)
bc$V4 <- as.numeric(bc$V4)
bc$V1[is.na(bc$V1)] <- 0
bc$V2[is.na(bc$V2)] <- 0
bc$V3[is.na(bc$V3)] <- 0
bc$V4[is.na(bc$V4)] <- 0
head(bc, 10)
bc$Cited <- with(bc, Cited + V1 + V2 + V3 + V4)
It's a silly patch but it does the trick in this particular context
I'm still learning R and have been given the task of grouping a long list of students into groups of four based on another variable. I have loaded the data into R as a data frame. How do I sample entire rows without replacement, one from each of 4 levels of a variable and have R output the data into a spreadsheet?
So far I have been tinkering with a for loop and the sample function but I'm quickly getting over my head. Any suggestions? Here is sample of what I'm attempting to do. Given:
Last.Name <- c("Picard","Troi","Riker","La Forge", "Yar", "Crusher", "Crusher", "Data")
First.Name <- c("Jean-Luc", "Deanna", "William", "Geordi", "Tasha", "Beverly", "Wesley", "Data")
Email <- c("a#a.com","b#b.com", "c#c.com", "d#d.com", "e#e.com", "f#f.com", "g#g.com", "h#h.com")
Section <- c(1,1,2,2,3,3,4,4)
df <- data.frame(Last.Name,First.Name,Email,Section)
I want to randomly select a Star Trek character from each section and end up with 2 groups of 4. I would want the entire row's worth of information to make it over to a new data frame containing all groups with their corresponding group number.
I'd use the wonderful package 'dplyr'
require(dplyr)
random_4 <- df %>% group_by(Section) %>% slice(sample(c(1,2),1))
random_4
Source: local data frame [4 x 4]
Groups: Section
Last.Name First.Name Email Section
1 Troi Deanna b#b.com 1
2 La Forge Geordi d#d.com 2
3 Crusher Beverly f#f.com 3
4 Data Data h#h.com 4
random_4
Source: local data frame [4 x 4]
Groups: Section
Last.Name First.Name Email Section
1 Picard Jean-Luc a#a.com 1
2 Riker William c#c.com 2
3 Crusher Beverly f#f.com 3
4 Data Data h#h.com 4
%>% means 'and then'
The code is read as:
Take DF AND THEN for all 'Section', select by position (slice) 1 or 2. Voila.
I suppose you have 8 students: First.Name <- c("Jean-Luc", "Deanna", "William", "Geordi", "Tasha", "Beverly", "Wesley", "Data").
If you wish to randomly assign a section number to the 8 students, and assuming you would like each section to have 2 students, then you can either permute Section <- c(1, 1, 2, 2, 3, 3, 4, 4) or permute the list of the students.
First approach, permute the sections:
> assigned_section <- print(sample(Section))
[1] 1 4 3 2 2 3 4 1
Then the following data frame gives the assignments:
assigned_students <- data.frame(First.Name, assigned_section)
Second approach, permute the students:
> assigned_students <- print(sample(First.Name))
[1] "Data" "Geordi" "Tasha" "William" "Deanna" "Beverly" "Jean-Luc" "Wesley"
Then, the following data frame gives the assignments:
assigned_students <- data.frame(assigned_students, Section)
Alex, Thank You. Your answer wasn't exactly what I was looking for, but it inspired the correct one for me. I had been thinking about the process from a far too complicated point of view. Instead of having R select rows and put them into a new data frame, I decided to have R assign a random number to each of the students and then sort the data frame by the number:
First, I broke up the data frame into sections:
df1<- subset(df, Section ==1)
df2<- subset(df, Section ==2)
df3<- subset(df, Section ==3)
df4<- subset(df, Section ==4)
Then I randomly generated a group number 1 through 4.
Groupnumber <-sample(1:4,4, replace=F)
Next, I told R to bind the columns:
Assigned1 <- cbind(df1,Groupnumber)
*Ran the group number generator and cbind in alternating order until I got through the whole set. (Wanted to make sure the order of the numbers was unique for each section).
Finally row binding the data set back together:
Final_List<-rbind(Assigned1,Assigned2,Assigned3,Assigned4)
Thank you everyone who looked this over. I am new to data science, R, and stackoverflow, but as I learn more I hope to return the favor.
I'd suggest the randomizr package to "block assign" according to section. The block_ra function lets you do this in a easy-to-read one-liner.
install.packages("randomizr")
library(randomizr)
df$group <- block_ra(block_var = df$Section,
condition_names = c("group_1", "group_2"))
You can inspect the resulting sets in a variety of ways. Here's with base r subsetting:
df[df$group == "group_1",]
Last.Name First.Name Email Section group
2 Troi Deanna b#b.com 1 group_1
3 Riker William c#c.com 2 group_1
6 Crusher Beverly f#f.com 3 group_1
7 Crusher Wesley g#g.com 4 group_1
df[df$group == "group_2",]
Last.Name First.Name Email Section group
1 Picard Jean-Luc a#a.com 1 group_2
4 La Forge Geordi d#d.com 2 group_2
5 Yar Tasha e#e.com 3 group_2
8 Data Data h#h.com 4 group_2
If you want to roll your own:
set <- tapply(1:nrow(df), df$Section, FUN = sample, size = 1)
df[set,] # show the sampled set
df[-set,] # show the complimentary set
Let's say I have:
Person Movie Rating
Sally Titanic 4
Bill Titanic 4
Rob Titanic 4
Sue Cars 8
Alex Cars **9**
Bob Cars 8
As you can see, there is a contradiction for Alex. All the same movies should have the same ranking, but there was a data error entry for Alex. How can I use R to solve this? I've been thinking about it for a while, but I can't figure it out. Do I have to just do it manually in excel or something? Is there a command on R that will return all the cases where there are data contradictions between two columns?
Perhaps I could have R do a boolean check if all the Movie cases match the first rating of its first iteration? For all that returns "no," I can go look at it manually? How would I write this function?
Thanks
Here's a data.table solution
Define the function
Myfunc <- function(x) {
temp <- table(x)
names(temp)[which.max(temp)]
}
library(data.table)
Create a column with the correct rating (by reference)
setDT(df)[, CorrectRating := Myfunc(Rating), Movie][]
# Person Movie Rating CorrectRating
# 1: Sally Titanic 4 4
# 2: Bill Titanic 4 4
# 3: Rob Titanic 4 4
# 4: Sue Cars 8 8
# 5: Alex Cars 9 8
# 6: Bob Cars 8 8
Or If you want to remove the "bad" ratings
df[Rating == CorrectRating][]
# Person Movie Rating CorrectRating
# 1: Sally Titanic 4 4
# 2: Bill Titanic 4 4
# 3: Rob Titanic 4 4
# 4: Sue Cars 8 8
# 5: Bob Cars 8 8
It looks like, within each group defined by "Movie", you're looking for any instances of Rating that are not the same as the most common value.
You can solve this using dplyr (which is good at "group by one column, then perform an operation within each group), along with the "Mode" function defined in this answer that finds the most common item in a vector:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
library(dplyr)
dat %>% group_by(Movie) %>% filter(Rating != Mode(Rating))
This finds all the cases where a row does not agree with the rest of the group. If you instead want to remove them, you can do:
newdat <- dat %>% group_by(Movie) %>% filter(Rating == Mode(Rating))
If you want to fix them, do
newdat <- dat %>% group_by(Movie) %>% mutate(Rating = Mode(Rating))
You can test the above with a reproducible version of your data:
dat <- data.frame(Person = c("Sally", "Bill", "Rob", "Sue", "Alex", "Bob"),
Movie = rep(c("Titanic", "Cars"), each = 3),
Rating = c(4, 4, 4, 8, 9, 8))
If the goal is to see if all the values within a group are the same (or if there are some differences) then this can be a simple application of tapply (or aggregate, etc.) used with a function like var (or compute the range). If all the values are the same then the variance and range will be 0. If it is any other value (outside of rounding error) then there must be a value that is different. The which function can help identify the group/individual.
tapply(dat$Rating, dat$Movie, FUN=var)
which(.Last.value > 0.00001)
tapply(dat$Rating, dat$Movie, FUN=function(x)diff(range(x)))
which(.Last.value != 0)
which( abs(dat$Rating - ave(dat$Rating, dat$Movie)) > 0)
which.max( abs(dat$Rating - ave(dat$Rating, dat$Movie)) )
dat[.Last.value,]
I would add a variable for mode so I can see if there is anything weird going on with the data, like missing data, text, many different answers instead of the rare anomaly,etc. I used "x" as your dataset
# one of many functions to find mode, could use any other
modefunc <- function(x){
names(table(x))[table(x)==max(table(x))]
}
# add variable for mode split by Movie
x$mode <- ave(x = x$Rating,x$Movie,FUN = modefunc)
# do whatever you want with the records that are different
x[x$Rating != x$mode, ]
If you want another function for mode, try other functions for mode