R Data frame only with matching rows - r

Hello i'm trying to create a data frame inside a loop, this data frame should have only matching values, i'm trying to implement a logic like that:
names<- unique(list(data$costumers))
for (i in 1:length(names)) {
city <- data$city where data$costumers == names[i]
costumer <- data$costumers where data$costumers == names[i]
df <- data.frame(costumer,city)
}
Basically i'm trying to make a data frame for each unique name in the list, i don't know how to compare in a data frame, i've tried the if statement but i couldn't get it to work.
a example dataframe input would be something like that:
costumer city
Joseph WS
Edward WS
Joseph NY
so the output dataframe would be like this:
costumer city
Joseph WS
Joseph NY
and the second dataframe output would be like that:
costumer city
Edward WS
In conclusion i'm trying to get a single data frame for every unique name in the list, and that data frame should have all the rows that include that name.

you can use split:
split(data, data$customer)
$Edward
customer city
2 Edward WS
$Joseph
customer city
1 Joseph WS
3 Joseph NY
you can either refer to the dataframes from this list or even
list2env(split(data, data$customer))
and now just call the dataframes by the customer's name

Related

Replace strings in a dataframe based on another dataframe

I have a 200k rows dataframe with a character column named "departament_name", some of the values in this column contain a specific char: "?". For example: "GENERAL SAN MART?N", "
UNI?N", etc.
I want to replace those values using another 750k rows dataframe that cointains a column also named "departament_name", but the values in this column are correct. Following the example, it will be: "GENERAL SAN MARTIN", "UNION", and so on.
Can I do this automatically using pattern recognition withouth making a dictionary (there are several values with this problem).
My objetive is to have an unified dataset with the two dataframes and unique values for those problematics rows in "departament_name". I prefer tidyverse (mutate, stringr, etc) if possible.
You can try using stringdist.* joins from fuzzjoin package.
fuzzyjoin::stringdist_left_join(df1, df2, by = 'departament_name')
# departament_name.x departament_name.y
#1 GENERAL SAN MART?N GENERAL SAN MARTIN
#2 UNI?N UNION
Obviously, this works for the simple example you have shared but it might not give you 100% correct result for all the entries in your actual data. You can tweak the parameters max_dist and method as per your data. See ?fuzzyjoin::stringdist_join for more information about them.
data
df1 <- data.frame(departament_name = c("GENERAL SAN MART?N", "UNI?N"))
df2 <- data.frame(departament_name = c("GENERAL SAN MARTIN", "UNION"))

Rmatching strings across columns of two dataframes in R

My apologies, in my haste to post my question I forgot to follow the basic rules of posting. I have edited my post in line with these rules:
R experts,
I appreciate that a similar question has been raised before but I am unable to adapt the solutions suggested before to my specific data problem. I basically have a dataframe (call it df1), where one of the columns is a string of sentences, part of which contains a city name and a country name. As an example, dataframe df1 has a column called bus_desc with the following data:
bus_desc
Company ABCD has a base capital of USD 5 million. Company ABCD is based in Mobile, Alabama, US. It is also known to have issued 10 million shares .....
Company XYZ has a history of producing bolts. Company XYZ operates out of Delhi, India. Its directors decided to....
In another dataframe (call it df2), I have two columns of data (named city and country) where each row contains a city name and the corresponding country as follows:
city
country
MOBILE
US
DELHI
INDIA
LONDON
UK
I want R to search the string of sentences of each row for that column in dataframe df1, and match the city (and the same for country) against the city name (and also the country name) from the relevant column in dataframe df2. If there is a match, I want to create a column called city in dataframe df1 and extract the city name from dataframe df2 and assign it to the row in dataframe df1. My final output should look like this:
bus_desc
city
country
Company ABCD has a base capital of USD 5 million. Company ABCD is based in Mobile, Alabama, US. It is also known to have issued 10 million shares .....
MOBILE
US
Company XYZ has a history of producing bolts. Company XYZ operates out of Delhi, India. Its directors decided to....
DELHI
INDIA
Can anyone please suggest a straightforward solution for this if it exists? I tried the below but it does not work
df1 <- df1 %>% rowwise() %>%
mutate(city=ifelse(grepl(toupper(bus_desc),df2$city),df2$city,df1))
Many thanks for your solutions and help on this.
Regards,
Dev
You can use str_extract to extract the city and country values listed in df2.
library(dplyr)
library(stringr)
df1 %>%
mutate(city = str_extract(toupper(bus_desc), str_c(df2$city, collapse = '|')),
country = str_extract(toupper(bus_desc), str_c(df2$country, collapse = '|')))
bus_desc
#1 Company ABCD has a base capital of USD 5 million. Company ABCD is based in Mobile, Alabama, US. It is also known to have issued 10 million shares
#2 Company XYZ has a history of producing bolts. Company XYZ operates out of Delhi, India. Its directors decided to
# city country
#1 MOBILE US
#2 DELHI INDIA
data
It is easier to help if you provide data in a reproducible format -
df1 <- data.frame(bus_desc = c('Company ABCD has a base capital of USD 5 million. Company ABCD is based in Mobile, Alabama, US. It is also known to have issued 10 million shares',
'Company XYZ has a history of producing bolts. Company XYZ operates out of Delhi, India. Its directors decided to'))
df2 <- data.frame(city = c('MOBILE', 'DELHI', 'LONDON'),
country = c('US', 'INDIA', 'UK'))

Creating horizontal dataframe from vertical table (with repeated variables)

Need to create usable dataframe using R or Excel
Variable1
ID
Variable2
Name of A person 1
002157
NULL
Drugs used
NULL
3.0
Days in hospital
NULL
2
Name of a surgeon
NULL
JOHN T.
Name of A person 2
002158
NULL
Drugs used
NULL
4.0
Days in hospital
NULL
5
Name of a surgeon
NULL
ADAM S.
I have a table exported from 1C (accounting software). It contains more than 20 thousand observations. A task is to analyze: How many drugs were used and how many days the patient stayed in the hospital.
For that reason, I need to transform the one dataframe into a second dataframe, which will be suitable for doing analysis (from horizontal to vertical). Basically, I have to create a dataframe consisting of 4 columns: ID, drugs used, Hospital stay, and Name of a surgeon. I am guessing that it requires two functions:
for ID it must read the first dataframe and extract filled rows
for Name of a surgeon, Drugs used and Days in hospital the function have to check that the row corresponds to one of that variables and extracts date from the third column, adding it to the second dataframe.
Shortly, I have no idea how to do that. Could you guys help me to write functions for R or tips for excel?
for R, I guess you want something like this:
load the table, make sure to substitute the "," with the separator that is used in your file (could be ";" or "\t" for tab etc.).
df = read.table("path/to/file", sep=",")
create subset tables that contain only one row for the patient
id = subset(df, !is.null(ID))
drugs = subset(df, Variable1 %in% "Drugs used")
days = subset(df, Variable1 %in% "Days in hospital")
#...etc...
make a new data frame that contains these information
new_df = data.frame(
id = id$ID,
drugs = drugs$Variable2,
days = days$Variable2,
#...etc...no comma after the last!
)
EDIT:
Note that this approach only works if your table is basically perfect! Otherwise there might be shifts in the data.
#=====================================================
EDIT 2:
If you have an imperfect table, you might wanna do something like this:
Step 1.5) , change all NA-values (which in you table is labeled as NULL, but I assume R will change that to NA) to the patient ID. Note that the is.na() function in the code below is specifically for that, and will not work with NULL or "NULL" or other stuff:
for(i in seq_along(df$ID)){
if(is.na(df$ID[i])) df$ID[i] <- df$ID[i-1]
}
Then go again to step 2) above (you dont need the id subset though) and then you have to change each data frame a little. As an example for the drugs and days data frames:
drugs = drugs[, -1] #removes the first column
colnames(drugs) = c("ID","drugs") #renames the columns
days = days[, -1]
colnames(days) = c("ID", "days")
Then instead of doing step 3 as above, use merge and choose the ID column to be the merging column.
new_df = merge(drugs, days, by="ID")
Repeat this for other subsetted data frames:
new_df = merge(new_df, surgeon, by="ID")
# etc...
That is much more robust and even if some patients have a line that others dont have (e.g. days), their respective column in this new data frame will just contain an NA for this patient.

Issues with copying row data and Paste -> R

I have an ascii file that contains one week of data. This data is a text file and does not have header names. I currently have nearly completed a smaller task using R, and have made some attempts with Python as well. Being a pro at neither, its been a steep learning curve. Here is my data/code to paste rows together based on a specific sequence of chr in R that I created and is not working.
Each column holds different data, but the row data is what matters most. for example:
column 1 column 2 column 3 column 4
Row 1 Name Age YR Birth Date
Row 2 Middle Name School name siblings # of siblings
Row 3 Last Name street number street address
Row 4 Name Age YR Birth Date
Row 5 Middle Name School name siblings # of siblings
Row 6 Last Name street number street address
Row 7 Name Age YR Birth Date
Row 8 Middle Name School name siblings # of siblings
Row 9 Last Name street number street address
I have a folder to iterate or loop over that some files hold 100's of rows, and others hold 1000's. I have a code written that drops all the rows I don't need, and writes to a new .csv however, any pasting and/or merging isn't producing the desirable results.
What I need is a code to select only the Name and Last name rows (and their adjacent data) from the entire file and paste the last name row beside the end of the name row. Each file has the same amount of columns but different rows.
I have the file to a data frame, and have tried merging/pasting/binding (r and c) the rows/columns, and the result is still just shy of what I need. Rbind works the best thus far, but instead of producing the data with the rows pasted one after another on the same line, they are pasted beside each other in columns like this:
ie:
Name Last Name Name Last Name Name Last Name
Age Street Num Age Street Num Age Street Num
YR Street address YR Street address YR Street address
Birth NA Birth NA Birth NA
Date NA Date NA Date NA
I have tried to rbind them or family[c(Name, Age, YR Birth...)] and I am not successful. I have looked at how many columns I have and tried to add more columns to account for the paste, and instead it populates with the data from row 1.
I'm really at a loss here and if anyone can provide some insight I'd really appreciate it. I'm newer than some, but not as new as others. The results I am achieving look like:
Name Age YR Birth date Last Name Street Num Street Address NA NA
Name Age YR Birth date Last Name Street Num Street Address NA NA
Name Age YR Birth date Last Name Street Num Street Address NA NA
codes tried:
rowData <- rbind(name$Name, name$Age, name$YRBirth, name$Date)
colData <- cbind(name$V1 == "Name", name$V1 == "Last Name")
merge and paste also do not work. I have tried to create each variable as new data frames and am still not achieving the results I am looking for. Does anyone have any insight?
Ok, so if I understand your situation correctly, you want to first slice your data and pull out every third row starting with the 1st row and then pull out every 3rd row starting with the 3rd row. I'd do it like this (assume your data is in df:
df1 <- df[3*(1:(nrow(df)/3)) - 2,]
df2 <- df[3*(1:(nrow(df)/3)),]
once you have these, you can just slap them together, but instead of using rbind you want to use cbind. Then you can drop the NA columns and rename them.
df3 <- cbind(df1,df2)
df3 <- df3[1:7]
colnames(df3) <- c("Name", "Age", "YR", "Birth date", "Last Name", "Street Num", "Street Address")

Changing Column Names using a Key

I have a data frame with 3 letter key column headings, which looks like:
MFB MBB WBB
X X X
and another data frame with the full names:
Key Descr
MFB Men's Football
MBB Men's Basketball
WBB Women's Basketball
My question is, how would I go about renaming the columns so the original table looks like:
Men's Football Men's Basketball Women's Basketball
X X X
There are about 80 column headings I want to rename, so manually renaming each column is not desired. My guess is it could be done using a for loop or the 'map2' function from the 'purrr' library, but I am not sure where to start.
Similar to Rename multiple columns given character vectors of column names and replacement
To make your question fully reproducible, let's start with
library(tidyverse)
sports <- tibble(MFB=c("bears", "texans", "packers"),
MBB=c("bulls", "heat", "bucks"),
WBB=c("dream", "sky", "sun"))
pairs <- tibble(Key=c("MFB", "MBB", "WBB"),
Descr=c("Men's Football", "Men's Basketball", "Women's Basketball"))
If the keys can be sorted in the same order as the original column headings, then a simple
setNames(sports,pairs$Descr)
works. Otherwise
sports %>% rename_at(pairs$Key, ~pairs$Descr)

Resources