How to get a .csv file into R? - r

I have this .csv file:
ID,GRADES,GPA,Teacher,State
3,"C",2,"Teacher3","MA"
1,"A",4,"Teacher1","California"
And what I want to do is read in the file using the R statistical software and read in the Header into some kind of list or array (I'm new to R and have been looking for how to do this, but so far have had no luck).
Here's some pseudocode of what I want to do:
inputfile=read.csv("C:/somedirectory")
for eachitem in row1:{
add eachitem to list
}
Then I want to be able to use those names to call on each vertical column so that I can perform calculations.
I've been scouring over google for an hour, trying to find out how to this but there is not much out there on dealing with headers specifically.
Thanks for your help!

You mention that you will call on each vertical column so that you can perform calculations. I assume that you just want to examine each single variable. This can be done through the following.
df <- read.csv("myRandomFile.csv", header=TRUE)
df$ID
df$GRADES
df$GPA
Might be helpful just to assign the data to a variable.
var3 <- df$GPA

You need read.csv("C:/somedirectory/some/file.csv") and in general it doesn't hurt to actually look at the help page including its example section at the bottom.

As Dirk said, the function you are after is 'read.csv' or one of the other read.table variants. Given your sample data above, I think you will want to do something like this:
setwd("c:/random/directory")
df <- read.csv("myRandomFile.csv", header=TRUE)
All we did in the above was set the directory to where your .csv file is and then read the .csv into a dataframe named df. You can check that the data loaded properly by checking the structure of the object with:
str(df)
Assuming the data loaded properly, you can think go on to perform any number of statistical methods with the data in your data frame. I think summary(df) would be a good place to start. Learning how to use the help in R will be immensely useful, and a quick read through the help on CRAN will save you lots of time in the future: http://cran.r-project.org/

You can use
df <- read.csv("filename.csv", header=TRUE)
# To loop each column
for (i in 1:ncol(df))
{
dosomething(df[,i])
}
# To loop each row
for (i in 1:nrow(df))
{
dosomething(df[i,])
}
Also, you may want to have a look to the apply function (type ?apply or help(apply))if you want to use the same function on each row/column

Please check this out if it helps you
df<-read.csv("F:/test.csv",header=FALSE,nrows=1)
df
V1 V2 V3 V4 V5
1 ID GRADES GPA Teacher State
a<-c(df)
a[1]
$V1
[1] ID
Levels: ID
a[2]
$V2
[1] GRADES
Levels: GRADES
a[3]
$V3
[1] GPA
Levels: GPA
a[4]
$V4
[1] Teacher
Levels: Teacher
a[5]
$V5
[1] State
Levels: State

Since you say you want to access by position once your data is read in, you should know about R's subsetting/ indexing functions.
The easiest is
df[row,column]
#example
df[1:5,] #rows 1:5, all columns
df[,5] #all rows, column 5.
Other methods are here. I personally use the dplyr package for intuitive data manipulation (not by position).

Related

Character vectors for subset conditions

I'm trying to subset a dataframe with character conditions from a vector!
This works:
temp <- USA[USA$RegionName == "Virginia",]
Now for a loop I created a column-vector containing all States by name, so I could filter through them:
> states
[1] "virginia" "Alaska" "Alabama" (...)
But if I know try to consign the "RegionName" condition via the column-vector it does not work anymore:
temp <- USA[USA$RegionName == states[1],]
What I tried so far:
paste(states[1])
as.factor(states[1])
as.character(states[1])
For recreation of the relevant dataframe:
string <- read.csv("https://raw.githubusercontent.com/OxCGRT/covid-policy-tracker/master/data/OxCGRT_latest.csv")
USA <- string[string$CountryCode=="USA",]
USA <- USA[USA$Jurisdiction=="STATE_TOTAL",]
states <- unique(USA$RegionName)
(In my vector Virginia is just on top for convenience!)
Based on the reproducible example, the first element of 'states' is empty
states[1]
[1] ""
We need to remove the blank elements
states <- states[nzchar(states)]
and then execute the code
dim(USA[USA$RegionName == states[1],])
[1] 569 51
I have not tried akruns option as the classic import function is very very very slow. When using a more modern approach I noticed, that the autoimport makes RegionName a logical and thus every values gets converted to NA. Therefore here is my approach to your problem:
# way faster to read in csv data and you need to set all columns to character as autoimport makes RegionName a logical returning all NA
string <- readr::read_csv("https://raw.githubusercontent.com/OxCGRT/covid-policy-tracker/master/data/OxCGRT_latest.csv", col_types = cols(.default = "c"))
USA <- string[string$CountryCode=="USA",]
USA <- USA[USA$Jurisdiction=="STATE_TOTAL",]
states <- unique(USA$RegionName)
temp <- USA[USA$RegionName == states[1],]
You will have to convert the columns acording to your need or specify exactly when importing which column should be of which data type
I must admit that this is a poorly asked question. Actually I wanted to subset the dataframe based on two conditions, a date condition and a state condition.
The date condition worked fine. The state condition did not work in combination or alone, so I assumed that the error was here.
In fact, I did a very bizarre transformation of the date from the original source. After I implemented another, much more reliable transformation, the state condition also worked fine, with the code in the question asked.
The error lay as apparently with a badly implemented date transformation! Sorry for the rookie mistake; my next question will be more sophisticated

Removing S4 class column in R (flowCore)

I work with the package flowCore found in Bioconductor, which reads my data files in an S4 class format. You can type library(flowCore) and then data(GvHD) which loads an example dataset. When you type GvHD you can see that this dataset is made out of 35 experiments, which can be accessed individually by typing for example GvHD[[1]].
Now I am trying to delete two columns FSC-H and SSC-H from all the experiments, but I have been unsuccessful.
I have tried myDataSet<- within(GvHD, rm("FSC-H","SSC-H")) but it doesn't work. I would greatly appreciate any help.
rm isn't meant for removing columns. The normal procedure is to assign NULL to that column:
for (i in 1:35){
GvHD[[i]][,c("FSC-H","SSC-H")] <- NULL
}
This is the same as you would do for a data frame.
I posted my question on the relevant GitHub page for flowCore and the answer was provided by Jacob Wagner.
GvHD[[1]] is a flowFrame, not a simple data frame, which is why the NULL assignment doesn't work. The underlying representation is also a matrix, which also doesn't support dropping a column by assigning it NULL.
If you want to drop columns, here are some ways you could do that. Note for all of these I'm subsetting columns for the whole flowSet rather than looping through each flowFrame. But you could perform these operations on each flowFrame as well.
As Greg mentioned, choose the columns you want to keep:
data(GvHD)
all_cols <- colnames(GvHD)
keep_cols <- all_cols[!(all_cols %in% c("FSC-H", "SSC-H"))]
GvHD[,keep_cols]
Or you could just filter in the subset:
GvHD[,!colnames(GvHD) %in% c("FSC-H", "SSC-H")]
You could also grab the numerical indices you want to drop and then use negative subsetting.
# drop_idx <- c(1,2)
drop_idx <- which(colnames(GvHD) %in% c("FSC-H", "SSC-H"))
GvHD[,-drop_idx]

How to reference variables from a list when looping over variables using "for"

I am a beginner at R coming from Stata and my first head ache is to figure out how I can loop over a list of names conducting the same operation to all names. The names are variables coming from a data frame. I tried defining a list in this way: mylist<- c("df$name1", "df$name2") and then I tried: for (i in mylist) { i } which I hoped would be equivalent to writing df$name1 and then df$name2 to make R print the content of the variables with the names name1 and name2 from the data frame df. I tried other commands like deleting a variable i=NULL within the for command, but that didn't work either. I would greatly appreciate if someone could tell me what am I doing wrong? I wonder if it has somethign to do with the way I write the i, maybe R does not interpret it to mean the elements of my character vector.
For more clarification I will write out the code I would use for Stata in this instance. Instead of asking Stata to print the content of a variable I am asking it to give summary statistics of a variable i.e. the no. of observations, mean, standard deviation and min and max using the summarize command. In Stata I don't need to refer to the dataframe as I ususally have only one dataset in memory and I need only write:
foreach i in name1 name2 { #name1 and name2 being the names of the variables
summarize `i'
}
So far, I don't manage to do the same thing using the for function in R, which I naivly thought would be:
mylist<-c("df$name1", "df$name2")
for (i in mylist) {
summary(i)
}
you probably just need to print the name to see it. For example, if we have a data frame like this:
df <- data.frame("A" = "a", "B" = "b", "C" = "c")
df
# > A B C
# > 1 a b c
names(df)
# "A" "B" "C"
We can operate on the names using a for loop on the names(df) vector (no need to define a special list).
for (name in names(df)){
print(name)
# your code here
}
R is a little more reticent to let you use strings/locals as code than Stata is. You can do it with functions like eval but in general that's not the ideal way to do it.
In the case of variable names, though, you're in luck, as you can use a string to pull out a variable from a data.frame with [[]]. For example:
df <- data.frame(a = 1:10,
b = 11:20,
c = 21:30)
for (i in c('a','b')) {
print(i)
print(summary(df[[i]]))
}
Notes:
if you want an object printed from inside a for loop you need to use print().
I'm assuming that you're using the summary() function just as an example and so need the loop. But if you really just want a summary of each variable, summary(df) will do them all, or summary(df[,c('a','b')]) to just do a and b. Or check out the stargazer() function in the stargazer package, which has defaults that will feel pretty comfortable for a Stata user.

Use lists/dataframes as items in for-loops in R

I am quite sure this is basic stuff, but I just can't find the answer by googling. So my problem:
I want to use a for-loop on a list of lists or data frames. But when you use list[i], you get all the values in the data frame instead of the data frame it self. Can anyone point out to me how to code this properly?
Example of the code:
a<-data.frame(seq(1:3),seq(3:1))
b<-data.frame(seq(1:3),seq(3:1))
l<-c(a,b)
Then l[1] returns:
> l[1]
$seq.1.3..
[1] 1 2 3
And I want it to just return: a
You can use the list function:
a<-data.frame(1:3,1:3)
b<-data.frame(3:1,3:1)
l<-list(a,b)
And access it's value with double brackets [[:
l[[1]]
l[[2]]
Ps: seq(1:3) and seq(3:1) outputs the same value, so I used 1:3 and 3:1. :)

How to use List of List of Dataframes

I´m not sure if this is possible or even how to get a good resolution for the following R problem.
Data / Background / Structure:
I´ve collected a big dataset of project based cooperation data, which maps specific projects to the participating companies (this can be understood as a bipartite edgelist for social network analysis). Because of analytical reasons it is advised to subset the whole dataset to different subsets of different locations and time periods. Therefore, I´ve created the following data structure
sna.location.list
[[1]] (location1)
[[1]] (is a dataframe containing the bip. edge-list for time-period1)
[[2]] (is a dataframe containing the bip. edge-list for time-period2)
...
[[20]] (is a dataframe containing the bip. edge-list for time-period20)
[[2]] (location2)
... (same as 1)
...
[[32]] (location32)
...
Every dataframe contains a project id and the corresponding company ids.
My goal is now to transform the bipartite edgelists to one-mode networks and then do some further sna-related-calculations (degree, centralization, status, community detection etc.) and save them.
I know how to these claculation-steps with one(!) specific network but it gives me a really hard time to automate this process for all of the networks at one time in the described list structure, and save the various outputs (node-level and network-level variables) in a similar structure.
I already tried to look up several ways of for-loops and apply approaches but it still gives me sleepless nights how to do this and right now I feel very helpless. Any help or suggestions would be highly appreciated. If you need more information or examples to give me a brief demo or code example how to tackle such a nested structure and do such sna-related calculations/modification for all of the aforementioned subsets in an efficient automatic way, please feel free to contact me.
Let's say you have a function foo that you want to apply to each data frame. Those data frames are in lists, so lapply(that_list, foo) is what we want. But you've got a bunch of lists, so we actually want to lapply that first lapply across the outer list, hence lapply(that_list, lapply, foo). (The foo will be passed along to the inner lapply with .... If you wish to be more explicit you can use an anonymous function instead: lapply(that_list, function(x) lapply(x, foo)).
You haven't given a reproducible example, so I'll demonstrate applying the nrow function to a list of built-in data frames
d = list(
list(mtcars, iris),
list(airquality, faithful)
)
result = lapply(d, lapply, nrow)
result
# [[1]]
# [[1]][[1]]
# [1] 32
#
# [[1]][[2]]
# [1] 150
#
#
# [[2]]
# [[2]][[1]]
# [1] 153
#
# [[2]][[2]]
# [1] 272
As you can see, the output is a list with the same structure. If you need the names, you can switch to sapply with simplify = FALSE.
This covers applying functions to a nested list and saving the returns in a similar data structure. If you need help with calculation efficiency, parallelization, etc., I'd suggest asking a separate question focused on that, with a reproducible example.

Resources