I have an ascii file that contains one week of data. This data is a text file and does not have header names. I currently have nearly completed a smaller task using R, and have made some attempts with Python as well. Being a pro at neither, its been a steep learning curve. Here is my data/code to paste rows together based on a specific sequence of chr in R that I created and is not working.
Each column holds different data, but the row data is what matters most. for example:
column 1 column 2 column 3 column 4
Row 1 Name Age YR Birth Date
Row 2 Middle Name School name siblings # of siblings
Row 3 Last Name street number street address
Row 4 Name Age YR Birth Date
Row 5 Middle Name School name siblings # of siblings
Row 6 Last Name street number street address
Row 7 Name Age YR Birth Date
Row 8 Middle Name School name siblings # of siblings
Row 9 Last Name street number street address
I have a folder to iterate or loop over that some files hold 100's of rows, and others hold 1000's. I have a code written that drops all the rows I don't need, and writes to a new .csv however, any pasting and/or merging isn't producing the desirable results.
What I need is a code to select only the Name and Last name rows (and their adjacent data) from the entire file and paste the last name row beside the end of the name row. Each file has the same amount of columns but different rows.
I have the file to a data frame, and have tried merging/pasting/binding (r and c) the rows/columns, and the result is still just shy of what I need. Rbind works the best thus far, but instead of producing the data with the rows pasted one after another on the same line, they are pasted beside each other in columns like this:
ie:
Name Last Name Name Last Name Name Last Name
Age Street Num Age Street Num Age Street Num
YR Street address YR Street address YR Street address
Birth NA Birth NA Birth NA
Date NA Date NA Date NA
I have tried to rbind them or family[c(Name, Age, YR Birth...)] and I am not successful. I have looked at how many columns I have and tried to add more columns to account for the paste, and instead it populates with the data from row 1.
I'm really at a loss here and if anyone can provide some insight I'd really appreciate it. I'm newer than some, but not as new as others. The results I am achieving look like:
Name Age YR Birth date Last Name Street Num Street Address NA NA
Name Age YR Birth date Last Name Street Num Street Address NA NA
Name Age YR Birth date Last Name Street Num Street Address NA NA
codes tried:
rowData <- rbind(name$Name, name$Age, name$YRBirth, name$Date)
colData <- cbind(name$V1 == "Name", name$V1 == "Last Name")
merge and paste also do not work. I have tried to create each variable as new data frames and am still not achieving the results I am looking for. Does anyone have any insight?
Ok, so if I understand your situation correctly, you want to first slice your data and pull out every third row starting with the 1st row and then pull out every 3rd row starting with the 3rd row. I'd do it like this (assume your data is in df:
df1 <- df[3*(1:(nrow(df)/3)) - 2,]
df2 <- df[3*(1:(nrow(df)/3)),]
once you have these, you can just slap them together, but instead of using rbind you want to use cbind. Then you can drop the NA columns and rename them.
df3 <- cbind(df1,df2)
df3 <- df3[1:7]
colnames(df3) <- c("Name", "Age", "YR", "Birth date", "Last Name", "Street Num", "Street Address")
Related
I need to create a data frame with the new column 'Variable1' in R below are the requirement. I have the one column name as 'Country' and another column name as 'City'. I need to check first if the data is available in the City column, if there is no data then move to the Country column based on the week. example:
Country Count
A 5
B 6
C 7
City Count
A 3
B 5
When I create a new column it first checks the count in the City column for all count values.. and if it's not available for it will move to Country and bring that count. and if even its not available in the country then it will just fill forward the last value
NEW Variable
A - 3
B-5
C-7
Can someone please assist with it? Is there any way to do it?
Need to create usable dataframe using R or Excel
Variable1
ID
Variable2
Name of A person 1
002157
NULL
Drugs used
NULL
3.0
Days in hospital
NULL
2
Name of a surgeon
NULL
JOHN T.
Name of A person 2
002158
NULL
Drugs used
NULL
4.0
Days in hospital
NULL
5
Name of a surgeon
NULL
ADAM S.
I have a table exported from 1C (accounting software). It contains more than 20 thousand observations. A task is to analyze: How many drugs were used and how many days the patient stayed in the hospital.
For that reason, I need to transform the one dataframe into a second dataframe, which will be suitable for doing analysis (from horizontal to vertical). Basically, I have to create a dataframe consisting of 4 columns: ID, drugs used, Hospital stay, and Name of a surgeon. I am guessing that it requires two functions:
for ID it must read the first dataframe and extract filled rows
for Name of a surgeon, Drugs used and Days in hospital the function have to check that the row corresponds to one of that variables and extracts date from the third column, adding it to the second dataframe.
Shortly, I have no idea how to do that. Could you guys help me to write functions for R or tips for excel?
for R, I guess you want something like this:
load the table, make sure to substitute the "," with the separator that is used in your file (could be ";" or "\t" for tab etc.).
df = read.table("path/to/file", sep=",")
create subset tables that contain only one row for the patient
id = subset(df, !is.null(ID))
drugs = subset(df, Variable1 %in% "Drugs used")
days = subset(df, Variable1 %in% "Days in hospital")
#...etc...
make a new data frame that contains these information
new_df = data.frame(
id = id$ID,
drugs = drugs$Variable2,
days = days$Variable2,
#...etc...no comma after the last!
)
EDIT:
Note that this approach only works if your table is basically perfect! Otherwise there might be shifts in the data.
#=====================================================
EDIT 2:
If you have an imperfect table, you might wanna do something like this:
Step 1.5) , change all NA-values (which in you table is labeled as NULL, but I assume R will change that to NA) to the patient ID. Note that the is.na() function in the code below is specifically for that, and will not work with NULL or "NULL" or other stuff:
for(i in seq_along(df$ID)){
if(is.na(df$ID[i])) df$ID[i] <- df$ID[i-1]
}
Then go again to step 2) above (you dont need the id subset though) and then you have to change each data frame a little. As an example for the drugs and days data frames:
drugs = drugs[, -1] #removes the first column
colnames(drugs) = c("ID","drugs") #renames the columns
days = days[, -1]
colnames(days) = c("ID", "days")
Then instead of doing step 3 as above, use merge and choose the ID column to be the merging column.
new_df = merge(drugs, days, by="ID")
Repeat this for other subsetted data frames:
new_df = merge(new_df, surgeon, by="ID")
# etc...
That is much more robust and even if some patients have a line that others dont have (e.g. days), their respective column in this new data frame will just contain an NA for this patient.
I have a data frame with 4 columns. On one of the columns I added a date so that each value looks like this
>print(result[[4]][[10000]])
[[10000]]
[1] "Jan" "14" "2012"
That means that on the 1000'th field of the 4th column I have these 3 fields. This is the only column that is multiple.
Now the other 3 columns of the data frame result are single values not multiple. One of those columns, the first one, has the states of the United States as values. What I want to do is create a new data frame from column 2 and 4 (the one described above) of the result data frame but depending on the state.
So for example I want all the 2nd column and 4th column data of the state of Alabama. I tried this but I don't think it is working properly. "levels" is the 2nd column and "weeks" is the 4th column of the data frame result.
rst <- subset(result, result$states == 'Alabama', select = c(result$levels, result$weeks))
The problem here is that subset is copying all the columns to rst and not just the second and fourth ones of the result data frame that are linked to Alabama state which are the only ones I want. Any idea how to do this correctly?
Edit to add the code
I'm adding the code here since I think there must be something I'm not seeing here. First a small sample of the original data which is on a csv file
st URL WEBSITE al aln wk WEEKSEASON
Alabama http://adph.org/influenza/ Influenza Surveillance Level 1 Minimal Oct-04-2008 40 2008-09
Alabama http://adph.org/influenza/ Influenza Surveillance Level 1 Minimal Oct-11-2008 41 2008-09
Alaska http://adph.org/influenza/ Influenza Surveillance Level 1 Minimal Oct-18-2008 42 2008-09
Alaska http://adph.org/influenza/ Influenza Surveillance Level 1 Minimal Oct-25-2008 43 2008-09
And this is the code
#Extracts relevant data from the csv file
extract_data<-function(){
#open the file. NAME SHOULD BE CHANGED
sd <- read.csv(file="sdr.csv",head=TRUE,sep=",")
#Extracts the data from the ACTIVITY LEVEL column. Notice that the name of the column was changed on the file
#to 'al' to make the reference easier
lv_list <- sd$al
#Gets only the number from each value getting rid of the word "Level"
lvs <- lapply(strsplit(as.character(lv_list), " "), function(x) x[2])
#Gets the ACTIVITY LEVEL NAME. Column name was changed to 'aln' on the file
lvn_list <- sd$aln
#Gets the state. Column name was changed to 'st' on the file
st_list <- sd$st
#Gets the week. Column name was changed to 'wk' on the file
wk_list <- sd$wk
#Divides the weeks data in month, day, year
wks <- strsplit(as.character(wk_list), "-")
result<-list("states"=st_list,"levels"=lvs,"lvlnames"=lvn_list,"weeks"=wks)
return(result)
}
forecast<-function(){
result=extract_data()
rst <- subset(result, states == 'Alabama', select = c(levels, weeks))
return(0) #return results
}
You're nearly there, but you don't need to reference the dataframe in the select argument - this should work:
rst <- subset(result, states == 'Alabama', select = c(levels, weeks))
You could also look into the package dplyr, which gives you SQL like abilities and is great for manipulating larger and more complicated data sets.
EDIT
Thanks for posting your code - I think I've identified a few problems.
The result you return from extract_data() is a list, not a data.frame - which is why the code in forecast() doesn't work. If it did return a dataframe the original solution would work.
You're forming your list out of a combination of vectors and lists, which is a problem - a dataframe is (roughly) a list of vectors, not a collection of the two types. If you replace your list creation line with result <- data.frame(...) you run into problems because of this.
There are two problematic columns - lvs (or levels) and wks (weeks). Where you use lapply(), using sapply() instead would give you a vector, as required (see the manual). The second issue is the weeks column. What you're actually dealing with here is a list of character vectors of length 3. There's no easy way to do what you want - you can't, for example, have each 'cell' of a column in a dataframe contain a character vector, as the columns are themselves vectors.
My suggestions would be to either:
Use the original format "Oct-01-2008", i.e. construct your data.frame with wk_list rather than splitting each date into the three strings;
Convert the original format into a better time format with a package like lubridate (A+++++ would recommend, great package);
Or finally, split the week column into three columns, so you'd have one for month, one for day and one for year. You could do this very simply from wk_list like this:
wks <- sapply(strsplit(as.character(wk_list), "-"), function(x) c(x[1], x[2], x[3]))
Month <- wks[1,]
Day <- wks[2,]
Year <- wks[3,]
Once both lvs and wks are in vector form, you're good to just run
result<-data.frame("states"=st_list,"levels"=lvs,"lvlnames"=lvn_list,"weeks"=wks)
and the script should work.
So I have a block of text that I've seperated into a vector, and from each line of vector I've further seperated it into a data frame. In a perfect world, every row of the DF would be exactly the same, but it's not and there a number of rows with NA values in them. What I need to do is select the row from the data frame with the least number of NA values.
So say the DF looked like this:
Name Year NA Address NA State NA
Name Year ID Address City State Rank
Name Year NA NA City State NA
Name NA NA NA NA NA Rank
Name Year NA NA NA NA NA
Where they each belong to column. So I need a way to identify which row has the least number of NA's, and then select that row's elements. So ultimately I want the return to just be single row DF (or a vector preferably) that reads
Name Year ID Address City State Rank
In this case, row 2.
I know that:
max( rowSums(!is.na(x)) )
Will return me the row# with the most number of not-na values, but I can't seem to figure out how to grab the elements of that row. I was thinking using which() would work, but I can't seem to figure it out.
Thanks for your help!
David
If your data frame is df, then:
df[which.max(rowSums(!is.na(df))),]
Should return the single-row data frame with the fewest NAs.
I'm running into some trouble while attempting a spatial join between a shapefile and a data table in csv.
Here's what my data looks like:
Point Shapefile's attribute data (StudentID):
ID Address Long Lat
123.00 street long lat
456.00 street long lat
789.01 street long lat
223.00 street long lat
412.02 street long lat
Data Table (Table):
ID Name Age School
123.00 name age school
456.00 name age school
789.01 name age school
223.00 name age school
412.02 name age school
Important note: StudentID contains roughly 500 records, while the Table only has 250. Some records in StudentID will NOT be matched.
Problem 1:
I have an excel file, which I converted to csv for importing into R. While running the join, I noticed that some of my data format changed in the ID column (so 123.00 would become 123; 456.00 to 456; 789.01 is the same). However, when I opened csv file in notepad the formatting is correct. I tried reading the table as a .txt file, but no luck. Does anyone know why this happens and what are some ways to overcome this?
Because I couldn't join the data based on an exact match, I decided to try a partial join because the IDS are unique regardless of the last 2 digits, which led me to Problem 2...
Problem 2:
Here is what I used to join the two:
StudentID#data = data.frame(StudentID#data, data[charmatch(StudentID#data$ID,Table$ID,])
This joined the data, but also, as expected, returned rows with NAs. I used na.omit to remove the rows and the resulting data contained all the ones that matched. However, in the shapefile, ALL of my points are still there. Why did those dots remain when the records have been removed?
Problem 1:
Excel sometimes exports floating values using a comma , as decimal separator. This can lead to problems in csv imports. Make sure that excel uses points . for decimal separators, or specify the separator in importing, i.e. read.csv('file.csv', sep=';').
Problem 2:
If you want to remove points with na values from a shapefile, you need a logical vector to select the rows you dont want anymore. Here is an example of how this could look like (assuming your shapefile was named student_points)
student_points <- student_points[!is.na(student_points#data$age), ]